How to make a multiplayer FPS that isn’t Battlefield or Call of Duty

So those guys (BF, CoD, maybe even Overwatch) are super-huge. They don’t really have problems with scaling. Or, maybe, in some cases, they do?

Here’s something I saw while playing Titanfall 2. First off, and I think this is a significant problem, they have a TON of datacenters, like around 8 or 10 within 100ms of my location. So I’m on at, like 1AM Pacific. There are a couple hundred people active on my closest-location data center (Salt Lake City). There are a couple hundred on in Oregon (GCE1?). A couple hundred more in Oregon-2 (GCE2). But we’re having matching problems, and it’s taking longer than it should to start games, and I keep seeing the same people. Logically, I can see that during peak times, with a crazy-high number of players logged-in, you want everyone to be as close as they can to their own data center, and you want as many damned data centers as you can throw money at to get. But when not? We’re splitting too many players across too many data centers, all for the sake of the difference between 45ms and 65ms. As it got later and later (I am a bit of a night owl) I started to check out other servers, farther away. In many cases, numbers as low as zero. Ugh. There’s a real concern that you might have some people logging in, seeing just a few users, and giving up. That’s enormously dangerous.

This is a really hard problem. Crazy hard. There are a lot of different ways to approach it, and as I was thinking about it, I thought I might’ve come up with a decent algorithm that might work. This is more of a thought experiment, me armchair-quarterbacking a problem that I’m sure the actual devs are already hard-at-work fixing. But anyways, here’s my idea:

  • First off, instead of connecting to your closest data-center, you instead connect to one Grand Unified Centralized registration system. First come, first serve. And, yes, if you live 500ms away, your registration time will be 500ms older than someone else. Too bad, that’s life. A “registration consist of just an IP/serial/whatever, a version number, and the number of milliseconds of delay to every datacenter that exists.
  • Now the server has a list of registrations, in time order. First, you grab the oldest registration. (Note the time, for keeping track of worst-case registration times). Grab the shortest datacenter from the list of datacenters. Walk through registrations in time order, trying to find (howevery many users you need for a game) users for whom that is also their closest datacenter.
    • Did that work? Then great. Shunt those ‘N’ players off to a game hosted in that data center. Then repeat with the next oldest registration.
    • Let’s say that didn’t work. Now, things get interesting. Of all the registrations you have, find the lowest-ranked first-choice. This, by the way, is the same method as Instant Runoff Voting. For anyone who had chosen that first choice, bump them to their second choice. That may include our candidate registration, itself.
    • Now, repeat the attempt to match up our candidate. Keep eliminating the least-popular data center until you can find a match.

So that’s it. Here are some of the practical advantages and disadvantages of this system, and some weird side-effects:

  • If there’s, like, a 10 minute wait (which there oughtn’t ever to be), some people could beat that time by being at the end of the queue, but being in a region that needs players. I think that’s OK.
  • I _think_ this might be hard to parallelize. You’d have high contention for the oldest items in the queue; you’d have to have some kind of locking mechanism, or something. This algorithm works best if running in series. Maybe there are some sneaky ways to do it otherwise, but I don’t know how off the top of my head. Maybe grab batches of 1000 registrations, and working that way?
  • Super-duper single-point-of-failure. You lose the centralized registration server, game over. Maybe you fall back to the ‘old’ way? Or maybe you’re just down, and that’s it.
  • I worry what might happen if you start getting to 10,000,000 players all online at the same time? Being unable to burn through those folks fast enough might become a problem. Maybe once your region gets >1000 users on it, you let them just hit their local region? I don’t know?
  • The centralized registration system does have one simple function: match players and bounce them over to their regions. So it has exactly-one job, and I can’t imagine a single ‘matching’ taking more than a few hundred milliseconds or so?
  • In terms of scaling and stuff, I would say that you should add capacity to a region when there are no free game servers available (or ‘game slots’). If new servers take a while to spin up, I’d be more liberal than that (maybe using an algorithm of “if there are less than ‘n’ slots available, then add a server”).
  • You can go completely nuts spinning up data centers. Until you get into the, like 1000’s or so, this algorithm ought to keep working pretty well.
  • When there are only ‘n’ people online, total – and they’re from all around the globe – they may end up having a bad time; high latency. That is life, too bad.
  • There may be a point where you say, “Sorry, person, but you’re fucked. There’s no one close enough to you to play a game.” That would be a terrible result, but it could happen. It depends on where the barrier to ‘too many milliseconds’ is.
  • I’ve thought about ways to globalize the registration system, with some kinds of distributed databases or whatever. I don’t think the costs are worth it, but, maybe, you could do some kind of distributed database-ey thing. I don’t know.
  • The registration data is pretty dang ephemeral. You could maybe do this with something like Redis, though some of the queries we’d be talking about doing might not work there.
  • I’d probably have it set up so that if the centralized service blows up and loses all its data, the various game clients will all try and reconnect. Especially if the registration service uses an ephemeral data-store, this could happen.

Of course, there are so many ‘gotchyas’ and caveats and potential failure modes, and all kinds of problems with networking and latency and who-knows-what. I don’t know if this system is harder to implement with those caveats or not.

And I’m just some guy, not a brilliant algorithms person or some distributed programming guru. This could all be a horrible, disastrous mistake of a system. I doubt most armchair quarterbacks actually call spectacular plays when they’re watching their various football games. (Look, sports reference! Yay!)

JavaScript: Callbacks, EventEmitters, and Promises – which one to use?!?

Short version

If you have something that’s simple and always synchronous, don’t use any of them. Just write a dumb function.

If you have a function that’s simple and only needs one asynchronous response – and there are no other potential responses – then a callback is fine.

If you have some kind of object that could have several different potential asynchronous responses – at various different points in lifecycle – and you might or might not want to listen to none, one, or more of them? Then use EventEmitters.

And finally, use Promises when:

  • You have a collection of asynchronous functions, and you need to respond only when all of them have returned, or any one of them have returned.
  • You’re doing mostly ‘imperative’ functions and don’t need to pass a lot of values around, you just need to chain together some callbacks in sequence.
  • You have some functions that might be synchronous, and some that might not be, and you’re not sure which until runtime
  • You have a collection of asynchronous events that are all firing, but the order that they must complete in is dependent upon some value determined only at runtime.
  • (weaker argument) You are falling victim to callback-hell, and your code is steadily creeping rightward

Long version

Functions


function foo(param1,param2,param3) {
  return "something";
}

//usage:
var a=foo(1,2,3);

If you can do it this way your life will be better. Do it this way if at all possible.

Simple Callbacks


function foo(param1,param2,param3,callback) {
  process.nextTick(function () {
    callback("something");
  });
}

//usage:
foo(1,2,3,function (result) {
  console.warn("Yay, we got result! "+result);
});

If you find you’re passing “callback1, callback2, callback3” definitely don’t do this. But for small, simple asynchronous functions, with not much else going on, this is still fine. Still pretty easy to reason about. As functions grow larger, and nested callbacks grow deeper, it gets harder and harder to reason about, though. The invocations of your little function probably ought not to be more than just a few lines; if they are, you should consider the next option…

EventEmitters

I think 95% of the EventEmitters I create end up being ‘classes’ that extend the EventEmitter class, and I think that’s probably a good way to do it.


   /* ..... */
   this.emit("begin","something");
   /* ..... */
   this.emit("success","something");
   /* ...... */ 
   this.emit("failure","something");
   /* ...... */
   this.emit("complete","something",success === true);

What’s great about this model is that someone who’s consuming this object might only care about one particular event – in which case they can listen for just that one. I believe it’s ok, and actually good, to emit liberally, even events that are similar but not the same (in my example “success”/”failure” as well as “complete”).

Another nice side-effect is that all of your various listen events (.on(foo)) help document what the callback is actually for. E.g. –


on("complete",function (param) {
  /* see? Now we know this event handler fires when things are complete! */
});

If you’re not careful, you can absolutely slide into callback-hell here. But this is my personal favorite pattern to use. It’s pretty extensible.

Never do synchronous callbacks; ever. If you want to do something ‘immediate’ at least wrap it in a process.nextTick(function () {/* blah */}); block; that’ll effectively be immediate but allow for someone to use it in the way most EventEmitters are used.

Never throw errors; just emit “error” instead.

Promises

These are massively over-hyped as the solution to everything. While they are actually very, very cool; they definitely have some real drawbacks.

  • They can get hard to debug
  • They can be confusing
  • missing something like a return – which is super-easy to do – may just cause silent code malfunctioning instead of issuing any kind of error
  • Propagating data forward from previously-resolved promises into later promises looks and acts weird.
  • You lose a lot of the benefits of ‘closing-over’ variables

But, when used properly – they can turn something nasty like this:


foo.on("bar",function (baz) {
  bif.on("blorgh",function (bling) {
    bloop.on("gloob",function (fweep) {
      /* .... */
    });
  });
});

Into something much prettier like this:


foo.bar().then(function (baz) {
  return "thing";
}).then(function(bling) {
  return "other_thing";
}).then(function (fweep) {
  return "last_thing";
});

Which, especially if you end up with a super-long list, can be helpful.

You can also use .catch() to grab any error in your list of actions – and that can be enormously useful.

Also, if you have an array of promises, you can do something like –


Promise.all(my_array_of_promises).then(function (results) {
  /* do something */
});

Which can be very, very handy.

Where it starts to get ugly is, if in that example I gave above foo.bar().... – if you need to treat an error condition for each of those steps slightly differently. Now you can throw various .catch statements after each .then statement, but I can imagine trying to visually read through that as being a nightmare.

The other huge thing here is that some promises can be fully synchronous – e.g. Promise.resolve(7) – that’s a promise that will resolve to the number 7. And some promises (well, probably most of them) are asynchronous. This is great, and the ability to unify these two modes together can be very helpful.

So, absolutely use them when they make sense. But my current thinking (which might change) is that you should use the simplest asynchronous mechanism that expresses what you need, without adding complexity. Step up the complexities of your tech as you need to, but not before.

Use the right tool for the job.

Anarchism

So there are a few people I’ve recently met who are anarchists, and I’ve told them all that I disagree with them. But I wanted to lay down my explanation as to why.

Let’s not talk about the moral underpinnings – because the morals behind any socio-political-economic system are always super-duper good and just. (e.g. socialism’s “From each according to ability, to each according to need”). But the devil’s always in the details. So let’s get into some details.

There are examples in actual history we can look at. The best modern example is probably Somalia, which basically has had no functioning central government for decades now. It, by most accounts, is not a very nice place. It is ruled by warlords. It is crippled by poverty and food shortages. If anarchy were so great, why isn’t Somalia a great place for anarchists to live?

We know the way that power tends to aggregate. See organized crime, or large multinational corporations (or perhaps I repeat myself – ZING). Though a great Libertarian/Anarchist argument against the organized crime part is that organized crime got the biggest boost in power during Prohibition. And the Mexican drug cartels that are currently dominating Mexico are being substantially weakened by the legalization of marijuana here in the US. And it’s a very good point, but the truth is that organized crime existed before and after prohibition, and will still exist even after we legalize pot. And large corporations existed before anti-trust legislation came about in the late 1800’s, early 1900’s, and afterwards. (Again, another Anarchist argument might be that large corporations would not have as much power without some kind of government intervention – if so I’d love to hear more about that; I think it was true of the Dutch East India Company, but more examples would be even better)

Let’s talk about a small town operating under anarchism. We’ll completely ignore the problems inherent in a large city – like my own New York – and just start focusing on my example small town. It’s got 100 residents, let’s say. Large cities will be probably even more problematic but I think I can explain my issues with my small town.

Problem #0 – I can walk down the street and just shoot someone in the face. There is no ‘legal’ ramification to that. If the person I do that to is not well-regarded, people might even cheer me on! Of course, if I do that to some beloved town local, I would assume that someone might come back and shoot me in the face. And I don’t want that. Of course the trick is to kill someone when no one else is looking.

Problem #1 – just about everyone has to own a gun. Some people might not, but in general, you just need to own one, primarily as a deterrent. With no formal social safety net, (plenty of informal ones, mind you! But nothing that’s guaranteed to catch people who are down on their luck) – there will be very desperate, very poor people who need things; or very depraved and lawless people who will take what they want. Some people may have an issue with having to own a firearm – and a system that practically forces them to do so seems unfair to those people. So a system that does not approve of force is now inherently, due to its structure, forcing people’s behavior.

So eventually due to the Organized Crime/Large Corporation problem, you will have to step up from the everyone-is-armed-at-home problem, and in to the Defense problem. E.g. instead of one down-on-their-luck person trying to take your possessions, killing you in the process – you now have the potential for a gang to come roaming through your town and ransack the place. You need some kind of defense; an army. So you hire one – and this is Problem #2. Well, it’s problem 2, 3, 4, 5 and 6. First off, you have to find an army that’s willing to defend your town – and we have a perfectly free market, so there will be a lot of competition, right? Maybe. In fact, your roving-gangs-of-ransackers are just as likely to be the ‘army’ that you’d hire. Or be somehow in cahoots. So how do we pay these people? We have 100 people in town, and we need to have them all band together to pay the army. But Old Man Caruthers doesn’t want to pay! Well, we can’t force him – Non-aggression principle. Now we have problem #3. So then we have to increase the price that everyone else pays to cover his share – and now all sorts of other people are going to start balking at the prices. So eventually you have to say, “either you pay, or you can get out of town.” That sounds like force. Or maybe you make a deal with the army – mark the houses that have paid, and they get protection, and the ones that don’t, don’t. Sounds like a mess. And things like securing the town’s borders won’t work in that way.

And how did we manage to select which army we got to defend us? A vote? A vote where only consensus is allowed? At 100 people consensus will be hard. At 1000 it will start to become impossible. As soon as we start having a ‘majority’ – then we’re coercing people, and breaking our own rules. Problem #4. (What about payola; the guys in Army Group #1 slipping $100 each to the people who are ‘on the fence’ to secure their vote?)

Problem #5 – who is to keep our ‘army’ in check? Let’s say I’ve got a roving band of raiders. Why don’t I meet up with the person in charge of the army-for-hire, we sit down and have a nice lunch, and I offer them a huge cash payout to stand down on such-and-such a day? Well, certainly, that would erode the trust one might have in such an army – if word ever got out. But why would it? My raiders would just go and kill everyone.

Problem #6 – how do you fire your army? Ideally, with another army, and the first ones just leave. But what if you decide you just don’t want an army at all, then what? And what if “your” army decides they don’t want to be fired?

And we haven’t even gotten into policing yet – which would probably end up being problem numbers 7 through 15…

And we haven’t even figured out what currency any of this stuff is bought or sold in. More problems.


So I think the real, fundamental economic problem here is this:

A market with no regulation at all is not at all free.

Not everyone has perfect information to make perfect economic choices. Certain goods and services exist in certain locations, and cannot be quickly or cheaply transported to wherever they are needed. Monopolies, cartels, and collusion happen and drive prices up. Gluts happen and drive prices down. There is inherent friction in every economic transaction.

And the political problem is this:

The Tragedy of the commons.

Without the ability to coerce people, and without the ability to form majority rule instead of consensus, you aren’t going to be able to do anything as a society. “The Commons” doesn’t have to be a physical thing; like a stream or a pond or grazing grass – it can be like our ‘how do we pay for the army’ problem above. Private property is not a solution. Private ownership of a common good like the water supply runs you into problems with inelastic demand – everyone needs water, so why not jack up the price for access to it? Still more thorny problems.

The Social problem is this:

This system completely and totally shafts the poor, and rewards the rich.

Can’t afford to pay for the army? Get out of town – or get treated however Mr. Caruthers got treated, above. Down on your luck? Hope for some handouts from private individuals. Still starving? Die. How much does this society help lift up the poor? How much does this society prevent the rich from just becoming more and more massively super-rich generation by generation; just sitting idle, reaping the rewards of actions done generations ago, or reaping the rewards of simple dumb luck?


PS:

Some interesting pro-anarchy thoughts I had while writing this up: What if you were to view this government as exactly the final result of having one of the armies in problem #2 defending you? E.g. the army won’t let you choose another army, it forces you to pay it. Though, to be nice, they charge a lower percentage of income to poor people and a higher one to rich people. What if that is, in effect, the government we have now, and modern taxation?

The other interesting one is how mafias and black markets tend to disappear when everything is permitted. Organized crime was at its most powerful here in the US in the middle of prohibition. Right now, it deals in drugs and other ‘sinful’ things. If all of those things were permitted, would organizations such as these disappear? What purpose would they serve? <CAVEAT – ARGUMENTUM AD MOVIE-UM> – in the Godfather Part II, we see a little glimpse of the early Sicilian Mob – and, while they were certainly murders, thieves, and extortionists – they were also community-builders, who helped their communities when the government would not. Maybe organizations/groups/towns whatever might end up acting like that?

I was also going to use the metaphor of prison for what happens when you ‘have no rules’. But prison has tons of rules! Yes, but the guards are really keeping “animals in cages” – and may not necessarily care for what the “animals” do. So that might-makes-right, everyone grouping into ‘tribles’ environment might be what you end up with. But what if that’s what the US *is* – the ‘rules’ the government puts on us are the prison guard’s rules, and today’s society is the same as that prison – tribalism, might-makes-right, what-have-you? I think the metaphor breaks down, but I still think it’s interesting.

Some actual things you could do for gun control stuff

Most people would probably agree to some formation of the following:

Someone ought to be able to own a firearm to protect their home and family. And all of these horrible shootings that keep happening are awful, and we should try to prevent them from happening. We can’t stop them all, but we can at least try to make it more difficult for them to happen.

And if you do agree with that, here’s my proposal:

Background-checks. Yes, even for private sales. In an era of $100 smart-phones, there’s a way to do it. When you sell a car, someone has to fill in or file a registration. It’s not unreasonable.

One-way database. In the same way you can have a ‘hash function’ which can map from source data to a hash value (But *not* backwards!), you should be able to map from serial numbers to people. But not the other way around. If you really need to see if someone has a firearm, you can get a warrant to search their house. But if a gun is used in a crime, and the serial number can be read off it, we need to be able to figure out who that gun belongs to. I would want to appoint some kind of privacy advocate to protect this data as well. The idea of the cops running around after Hurricane Katrina happened, confiscating legal firearms of civilians is something that should be prevented from even happening, and made even more terribly illegal than it was already.

Providing a firearm to someone who then commits a crime means you are an accessory. And should be criminally charged. Improper storage or securing of ones firearm(s) which are then used in crime mean you are negligent, or possibly an accessory.

And yes, that means if you ‘lose’ or have your firearm stolen, you need to report it. And that means if you didn’t secure it you can be charged. And that means you need to check in on your firearm every, say, 6 months or so – no saying “Oh, I forgot I had it! Haven’t looked at where I keep it in a while…”

That also means when you’re doing tearful press conferences about how no one knew that your kid could go shoot up a school, it’s likely you’ll be wearing prison-orange. Because you probably weren’t properly storing your firearms. Because if you were, maybe your kid would’ve had a harder time shooting up that school?

As for the definitions of what these things are? (How would a ‘household firearm’ work? What does ‘properly stored’ entail? Etc.) I don’t know and I think that’s probably an important place for us to get to. But let’s start a conversation here.

What about firearm types? I don’t care about that. AR-15 or AK-47 or simple 9mm Glock. Honestly, that’s the wrong road to go down. The right road to go down is stopping people who shouldn’t be able to buy guns managing from buying or acquiring them somehow. And making gun-owners responsible for secure and safe storage of their firearms. That, honestly, is not unreasonable.

What I learned from the Defcon CTF

So if anyone follows me on Twitter, they might’ve caught that I tried the Defcon CTF challenge a week or so ago. I didn’t place on the finalist list; most of that stuff is waaaaaaay out of my league.

But the one category I thought was pretty interesting – and I ought to do well in – was the Web one. So I tried to get at least one point in that category, so I could prove to myself that I could do this type of stuff. I’m not a security guy; I’m a web developer.

The result? I got all five questions 🙂 The last one I got with just an hour to spare.

And it really got me to thinking – as a developer, ‘engineer’ or architect or whatever I am – about some of the security things that I haven’t really thought that deeply about before.

Here’s what I came up with:

#1) Don’t ever use dictionary passwords. Not even with your cl3v3r subst1tut10ns of punctuation or numbers for letters.

Why? Because I tried to brute force a password that was very strongly hashed (SHA-512). I was grinding through 3 and 4 character passwords with a custom-built script I put together. It had been running overnight and I got nothing from it.

But when the lovely and talented Snipeyhead pointed me over towards a password cracker tool, I decided to give that a shot.

The tool spat out the password I needed in probably about 60 seconds.

And the tool had a ruleset – built in to it – that allowed it to automatically test out numeric and punctuation substitutions. So your clever password that’s based on a dictionary word might get cracked – and maybe not with today’s ruleset, but definitely with tomorrow’s.

The password length is actually *less* of a big deal. Of course, if you try and brute-force a password (as I did), a longer one will take longer to force than a shorter one. But if your super duper long password is just a dictionary word – then, no, you’re still fucked.

If I were building something from scratch? I would definitely use a very strong hashing method (SHA-512? Bcrypt?) for password storage, but I would play around with different types of password requirements. If the user wants a super duper short password? Maybe it has to have lots of different types of characters. One that has just letters in it? Better be pretty damned long. Who knows? Maybe I’d just stick with what we’re doing now.

But regardless of that – if your password can be cracked with a dictionary, then you can’t use it. End of story.

(edit) And try not to expose usernames, maybe?

If you don’t know what username to dictionary-attack (or brute-force attack) – you don’t know what you’re going after. You can guess – but if you’re at least not exposing “valid username, but invalid password!” (and, holy crap, I hope you aren’t!) then you make their job just a liiiiittle bit harder. And that’s worth doing, if you can.

#2) Hashing (to prevent message alteration or tampering) and crypto (confidentiality) are completely orthogonal concepts.

If you want to hide the contents of the message, encrypt it. But someone clever can still mess with the contents a little bit. In fact, that’s just what I did in challenge number 5 🙂

If you want to ensure that a message is not altered in any way, present a hash of it (salted with a secret salty thing). This way if you change one tiny bit anywhere on the message, the resulting salt should change substantially.

So if you need both? DO BOTH. The two functions have nothing to do with each other.

#3) Error Messages in production

This one is probably the most embarrassing for me. I have, on plenty of occasions, allowed apps to go into production with full error messaging intact.

This is usually because I write shitty software that breaks. So when it does, I like to be able to quickly see what happened. So my heart’s in the right place, even if my code isn’t.

But this is a ridiculously fucking terrible fucking idea.

More than half of the challenges I was going after started to give me hints on how to break in once I was able to fingerprint what type of app they were (this year? Lots of Sinatra. Feel bad for those guys, like they got ‘picked on’ – but I guess, considering some of their bláse attitudes about security (calling hashed stuff ‘encrypted’?), maybe it’s warranted). Maybe it’s a wakeup call. Or maybe it’s just because Sinatra is fun to write? Who knows.

But, seriously – trying to figure what type of thing I was going after really took a lot of time away from trying to figure out how to break in. So the harder you can make those first couple of steps for an attacker (like me) – maybe the easier it might be to get him to go look at somebody else instead.

Amazon cloud-init – customizing EBS-backed Amazon Linux AMI’s

EDIT – No, not even this works. I feel like I’m losing my mind.

EDIT 2 – Oh, apparently you *have* to specify the boot kernel. Have to. Can’t use “use default” as I have been for, like, ever. Ugh. Angry.

I just blew a horrible amount of time on this. I’ve burned many an AMI – based on ephemeral store and EBS-backed volumes. But trying to do it ‘right’ – with programmable private keys and whatnot – seemed to be out of my grasp, at least when using Amazon’s own Linux distro.

If you try to customize Amazon Linux you will find that some things that are normally done by cloud-init don’t seem to work on your image. Namely, setting ssh keys. It works fine when you first boot the pristine Amazon image, but when you try to burn your own it won’t seem to set the ssh keys properly.

To set them, make sure you blow out the contents of /var/lib/cloud/ – and both /root/.ssh/authorized_keys as well as /home/ec2-user/.ssh/authorized_keys. They’ll get reset on next boot.

This isn’t documented anywhere and I basically had to dick around with strace and flipping through all of the python code to figure out that there’s a semaphore file in /var/lib/cloud/sem that gets set and then the ssh-setting-script at boot will never run again. It makes me angry – but maybe that’s Amazon’s point; they don’t want you to customize their image so they can save on EBS volume space. I don’t know. Pisses me off and wastes my time for sure though.

You would think that at least when I try to run stuff by hand it would say “Oh, hey, there’s a semaphore file right here – make sure to yank it if you really want to run your scripts again.” Not this silent no-message bullshit.

ARGH.

-B.

Follow up on Amazon Elastic Load Balancers and multi-AZ configuration

I got a really good comment on my blog a day or so ago from a guy by the name of Mark Rose (that’s the only link I have for him, sorry!) He mentioned that AWS multi-AZ load-balancing happens via DNS – which intrigued me – so I thought I’d mess with my test load balancer and see.

He explained that each AZ gets its own DNS entry when you look up the load balancer – and that meshes exactly with what I’m getting. I do the DNS lookup for the LB, and get two IP addresses right now – and I’m assuming that each one corresponds to one of the LB’s.

But Amazon does some interesting DNS stuff – for instance, if you look up one of your ‘public DNS names’ of your instances from the _outside_, you get the instance’s outside IP address. But if you look it up from the _inside_, you get the inside IP. I use this for configuration settings, when I want an instance to have a relatively-static internal IP. Instead of trying to pin down the IP, I set up an elastic IP for the instance, and use the permanent public DNS name for that IP as the _internal_ hostname for the instance. This way, if the instance reboots, I just have to make sure that the elastic IP address I’ve configured is still associated with it, and everything still works internally.

I assume that traffic to the inside IP address is faster than bouncing back outside to the public address, then going back inside. I definitely know that it is cheaper – you don’t pay for internal network traffic, only external.

So my question is – what does it look like when you try to resolve the load balancer’s DNS name from the _inside_ of Amazon AWS? Do you get the same outside IP addresses, or do you get internal ones instead? Since it seemed like AWS traffic ‘tends’ to be directed back to the same AZ it originated from, I expect to get different answers.

So here’s what I did. Set up an ELB with two AZ’s – us-east-1a and us-east-1e. I installed apache and launched it on both. As soon as the ELB registered the instances as ‘up’, I did a DNS lookup from the outside to see what it resolved to.

I got exactly two addresses – I’m assuming one points to one AZ, one to another.

Then, I tried to resolve the same ELB DNS name from the _inside_. Weirdly enough, I *still* get both (outside) IP addresses! I didn’t expect that.

So now, I wonder, is there anything to ‘bias’ the traffic to one AZ or another? Or is it just the vagaries of DNS round-robin that have been affecting me?

I changed the home pages on both apaches to report which AZ they’re in. I then browsed-to, and curl’ed, the ELB name. The results were surprisingly ‘sticky’ – on the browser, I kept seeming to hit ‘1-a’. On curl, I seemed to keep hitting 1-e.

What if I specifically direct my connections to one IP or another? Let’s see.

Since the ELB IP addresses seem to correspond, one-to-one, with AZ’s, I thought I would be able to curl each one. I did, and consistently got the same AZ for each IP. One seems to be strongly associated to 1-a, and one to 1-e.

So it seems the coarseness of the multi-AZ ELB load-balancing can be fully explained by the coarseness of using round-robin DNS to implement it.

Something else to note – it seems like the DNS entries *only* have 60 second lifetimes. With well-behaved DNS clients (of which I will bet there are depressingly few), you should at *least* be able to end up changing the AZ you’re banging into every 60 seconds. However, in my testing – brief though it may be – it seems to stay pretty ‘sticky’.

So what does this mean? I dunno – I feel like I want to do multi-AZ setups in AWS even less now. round-robin DNS is old-school, but at large enough scales does generally work. Though I wonder if heavily-hit web services API’s like the ones my company provides fit will enough into that framework? I’m not sure.

Session stickiness and multi-AZ setups

Another question – how does this affect ‘stickiness’? You can set an LB to provide sticky session support – but with this IP address shenaniganry, how can that possibly work?

Well, weirdly enough – it actually does.

I set an Amazon load-balancer-provided stickiness policy on my test LB. I curl’ed the name, and got the cookie. I then curl’ed the individual IP addresses for the load balancer, with that cookie set. And now, no matter which IP I hit, I keep getting the results from the same back-end server. So session-stickiness *does* break-through the load-balancer’s IP-address-to-AZ associations, to always keep hitting the same back-end server.

I wonder, what does the AWS-provided cookie actually look like? It seems like Hex, so let me see if I can decipher it.

Since I don’t know if anything scary is encoded therein, I won’t post my cookie here, but when I tried to decipher it, I just got a bunch of binary gobbeldygook. Stayed consistent from request-to-request (maybe modulo time, not sure), so probably just encodes an AZ and/or an instance ID (and maybe time).

Implications

So since AWS exposes some of the internal implementation details of your load-balancer setups, what does this mean? It certainly does imply that you can lower the bar for DoS’ing a website that’s ELB-hosted by just picking one of the ELB IP’s and slamming it. For a two-AZ example – as opposed to having to generate 2x traffic to overwhelm a site, you can just pick one IP and hit that one with 1x and have the site go half-down from it.

Considering the issues I’ve actually run into from having autoscaling groups that won’t scale because only one AZ is overwhelmed, I wonder if it makes sense to only have autoscaling groups that span a single AZ?

And it also seems to imply that you can DoS an individual server by hitting it with a session-cookie that requires it to always hit the same back-end server. So perhaps, for high-performance environments, it makes sense to stick with shared-nothing architectures and *not* do any kind of session-based stickiness?

RightScale-to-Native Amazon Web Services (AWS) Name Synchronizer

At my company, we use RightScale for a lot of our Amazon Web Services management. It’s a pretty neat service – sort of “training wheels” for the cloud. Still provides us a lot of value.

But sometimes I like to log directly into the AWS console. Especially to find out when Amazon has scheduled reboots of our servers. Before I wrote this script, I would log in to find a whole bunch of instances running with no names. Then I’d have to go look them up in RightScale. Why can’t RightScale just name your Amazon instances with the right names?!

Well, I finally took matters into my own hands and built the following script. It walks through all of your RightScale servers, and finds the associated Amazon instances and sets their name attributes to the RightScale “nicknames.”

And I got permission from my job to make it available to the public – so here it is:

https://github.com/uberbrady/RightScaleNameSynchronizer

Yes, it is not the prettiest code I have ever written, but it does the trick. If someone wants to make it prettier I am definitely open to pull requests.

One thing I have noticed is that when you ‘relaunch’ a RightScale instance, the new instance will come up without an AWS name. If you re-run the script that will fix that. Also, if you use any RightScale arrays, the same thing can happen during scale-up/scale-down events.

ucspi-tcp and stupid errno.h (CentOS and ucspi-tcp)

I keep running into this and doing my standard google-up-the-answer-routine didn’t seem to be working.

In short, ucspi-tcp doesn’t compile on CentOS boxes (or RedHat boxes). Cuz DJB doesn’t “believe in” RedHat’s “you must have an errno.h” thing. Hey, I love DJB, and his software, but I also think he’s impractical and a nutjob sometimes. This would be one of those times.

Lots of folks had patch-related ways of fixing the problem, I thought those seemed rather laborious. I just stole The Internet’s method for another DJB package.

Just append -include /usr/include/errno.h at the end of the first line of conf-cc so it looks like this:

gcc -O2 -include /usr/include/errno.h

This will be used to compile .c files.

Boom, everything works now.

Even Mo’ Math…

So Beckley got a hold of the MetroCard Math site and built on top of David’s fantastic work to build even more prettiness, neat-workingness, and general niftitude into the site.

We also put in a thingee – well, by ‘we’ I mean ‘he’ – he put in a thingee that lets you see how the new price changes will affect you. For me, I definitely will be sticking with the pay-per-ride.

And another thing – I actually tested the new (divisible-by-a-nickel) magic number, and it *does* work. My MetroCard has an exactly even number of rides on it. Cool. Now I just have to do something with all these MetroCards that have 10 or 20 cents on them – perhaps a new part of the site that lets you put in how much money is on your cards, and then it tells you how much more to put on to get it ‘even’? Not a bad idea…

Gory Details: so, talk to any computer sciencey person and they will always tell you that Floating Point Math is Hard. I have only rarely run into this, but the rounding algorithms are very specific when you buy stuff, and if you’re off by a penny, then, well, you’re off by a penny, and things stop working. We found a couple of minor (off-by-one) bugs here and there, and every time it seems like I fixed one, the rest of the results would start to go haywire. The real problem is that I am trying to ‘move’ the rounding around the formula:

round_for_money($x * 1.15) = n * $2.25

Now solve for ‘x’, and let ‘n’ be any integer – well, that pesky ’round()’ is in the way, and if you just try to move it to the other side, or round at some random and/or inopportune time, then when you get back to the original equation, sometimes the numbers don’t work out anymore. It sucks.

So I racked and racked my brain trying to figure out a way to do my simple solve-for-x routine. I really just want to try different integers for ‘n’ until I find an answer that’s “acceptable.” But that doesn’t work. At all. Or at least, I don’t know what mathematical operation I can do to move that round() function off the left side so I can try to have a formula that points to ‘x’.

What did I do finally? I gave up. I left the formula as it is above, and just run ‘x’ from 0 to “a lot” (a thousand bucks or a hundred bucks I think?). The answer I get is going to be completely accurate, but it wastes computing power. Well, too bad, your browser has to do a little bit of multiplication in a loop. My condolences. But! The result is, I’m pretty convinced my answers are to-the-penny accurate now. We’ll see when the big price change kicks in.

Thanks again to David Dominguez for the initial switch to jQuery-powered MetroCard Math, and thanks to Beckley for the full re-skinning he pulled off.