So those guys (BF, CoD, maybe even Overwatch) are super-huge. They don’t really have problems with scaling. Or, maybe, in some cases, they do?
Here’s something I saw while playing Titanfall 2. First off, and I think this is a significant problem, they have a TON of datacenters, like around 8 or 10 within 100ms of my location. So I’m on at, like 1AM Pacific. There are a couple hundred people active on my closest-location data center (Salt Lake City). There are a couple hundred on in Oregon (GCE1?). A couple hundred more in Oregon-2 (GCE2). But we’re having matching problems, and it’s taking longer than it should to start games, and I keep seeing the same people. Logically, I can see that during peak times, with a crazy-high number of players logged-in, you want everyone to be as close as they can to their own data center, and you want as many damned data centers as you can throw money at to get. But when not? We’re splitting too many players across too many data centers, all for the sake of the difference between 45ms and 65ms. As it got later and later (I am a bit of a night owl) I started to check out other servers, farther away. In many cases, numbers as low as zero. Ugh. There’s a real concern that you might have some people logging in, seeing just a few users, and giving up. That’s enormously dangerous.
This is a really hard problem. Crazy hard. There are a lot of different ways to approach it, and as I was thinking about it, I thought I might’ve come up with a decent algorithm that might work. This is more of a thought experiment, me armchair-quarterbacking a problem that I’m sure the actual devs are already hard-at-work fixing. But anyways, here’s my idea:
- First off, instead of connecting to your closest data-center, you instead connect to one Grand Unified Centralized registration system. First come, first serve. And, yes, if you live 500ms away, your registration time will be 500ms older than someone else. Too bad, that’s life. A “registration consist of just an IP/serial/whatever, a version number, and the number of milliseconds of delay to every datacenter that exists.
- Now the server has a list of registrations, in time order. First, you grab the oldest registration. (Note the time, for keeping track of worst-case registration times). Grab the shortest datacenter from the list of datacenters. Walk through registrations in time order, trying to find (howevery many users you need for a game) users for whom that is also their closest datacenter.
- Did that work? Then great. Shunt those ‘N’ players off to a game hosted in that data center. Then repeat with the next oldest registration.
- Let’s say that didn’t work. Now, things get interesting. Of all the registrations you have, find the lowest-ranked first-choice. This, by the way, is the same method as Instant Runoff Voting. For anyone who had chosen that first choice, bump them to their second choice. That may include our candidate registration, itself.
- Now, repeat the attempt to match up our candidate. Keep eliminating the least-popular data center until you can find a match.
So that’s it. Here are some of the practical advantages and disadvantages of this system, and some weird side-effects:
- If there’s, like, a 10 minute wait (which there oughtn’t ever to be), some people could beat that time by being at the end of the queue, but being in a region that needs players. I think that’s OK.
- I _think_ this might be hard to parallelize. You’d have high contention for the oldest items in the queue; you’d have to have some kind of locking mechanism, or something. This algorithm works best if running in series. Maybe there are some sneaky ways to do it otherwise, but I don’t know how off the top of my head. Maybe grab batches of 1000 registrations, and working that way?
- Super-duper single-point-of-failure. You lose the centralized registration server, game over. Maybe you fall back to the ‘old’ way? Or maybe you’re just down, and that’s it.
- I worry what might happen if you start getting to 10,000,000 players all online at the same time? Being unable to burn through those folks fast enough might become a problem. Maybe once your region gets >1000 users on it, you let them just hit their local region? I don’t know?
- The centralized registration system does have one simple function: match players and bounce them over to their regions. So it has exactly-one job, and I can’t imagine a single ‘matching’ taking more than a few hundred milliseconds or so?
- In terms of scaling and stuff, I would say that you should add capacity to a region when there are no free game servers available (or ‘game slots’). If new servers take a while to spin up, I’d be more liberal than that (maybe using an algorithm of “if there are less than ‘n’ slots available, then add a server”).
- You can go completely nuts spinning up data centers. Until you get into the, like 1000’s or so, this algorithm ought to keep working pretty well.
- When there are only ‘n’ people online, total – and they’re from all around the globe – they may end up having a bad time; high latency. That is life, too bad.
- There may be a point where you say, “Sorry, person, but you’re fucked. There’s no one close enough to you to play a game.” That would be a terrible result, but it could happen. It depends on where the barrier to ‘too many milliseconds’ is.
- I’ve thought about ways to globalize the registration system, with some kinds of distributed databases or whatever. I don’t think the costs are worth it, but, maybe, you could do some kind of distributed database-ey thing. I don’t know.
- The registration data is pretty dang ephemeral. You could maybe do this with something like Redis, though some of the queries we’d be talking about doing might not work there.
- I’d probably have it set up so that if the centralized service blows up and loses all its data, the various game clients will all try and reconnect. Especially if the registration service uses an ephemeral data-store, this could happen.
Of course, there are so many ‘gotchyas’ and caveats and potential failure modes, and all kinds of problems with networking and latency and who-knows-what. I don’t know if this system is harder to implement with those caveats or not.
And I’m just some guy, not a brilliant algorithms person or some distributed programming guru. This could all be a horrible, disastrous mistake of a system. I doubt most armchair quarterbacks actually call spectacular plays when they’re watching their various football games. (Look, sports reference! Yay!)