Follow up on Amazon Elastic Load Balancers and multi-AZ configuration

I got a really good comment on my blog a day or so ago from a guy by the name of Mark Rose (that’s the only link I have for him, sorry!) He mentioned that AWS multi-AZ load-balancing happens via DNS – which intrigued me – so I thought I’d mess with my test load balancer and see.

He explained that each AZ gets its own DNS entry when you look up the load balancer – and that meshes exactly with what I’m getting. I do the DNS lookup for the LB, and get two IP addresses right now – and I’m assuming that each one corresponds to one of the LB’s.

But Amazon does some interesting DNS stuff – for instance, if you look up one of your ‘public DNS names’ of your instances from the _outside_, you get the instance’s outside IP address. But if you look it up from the _inside_, you get the inside IP. I use this for configuration settings, when I want an instance to have a relatively-static internal IP. Instead of trying to pin down the IP, I set up an elastic IP for the instance, and use the permanent public DNS name for that IP as the _internal_ hostname for the instance. This way, if the instance reboots, I just have to make sure that the elastic IP address I’ve configured is still associated with it, and everything still works internally.

I assume that traffic to the inside IP address is faster than bouncing back outside to the public address, then going back inside. I definitely know that it is cheaper – you don’t pay for internal network traffic, only external.

So my question is – what does it look like when you try to resolve the load balancer’s DNS name from the _inside_ of Amazon AWS? Do you get the same outside IP addresses, or do you get internal ones instead? Since it seemed like AWS traffic ‘tends’ to be directed back to the same AZ it originated from, I expect to get different answers.

So here’s what I did. Set up an ELB with two AZ’s – us-east-1a and us-east-1e. I installed apache and launched it on both. As soon as the ELB registered the instances as ‘up’, I did a DNS lookup from the outside to see what it resolved to.

I got exactly two addresses – I’m assuming one points to one AZ, one to another.

Then, I tried to resolve the same ELB DNS name from the _inside_. Weirdly enough, I *still* get both (outside) IP addresses! I didn’t expect that.

So now, I wonder, is there anything to ‘bias’ the traffic to one AZ or another? Or is it just the vagaries of DNS round-robin that have been affecting me?

I changed the home pages on both apaches to report which AZ they’re in. I then browsed-to, and curl’ed, the ELB name. The results were surprisingly ‘sticky’ – on the browser, I kept seeming to hit ‘1-a’. On curl, I seemed to keep hitting 1-e.

What if I specifically direct my connections to one IP or another? Let’s see.

Since the ELB IP addresses seem to correspond, one-to-one, with AZ’s, I thought I would be able to curl each one. I did, and consistently got the same AZ for each IP. One seems to be strongly associated to 1-a, and one to 1-e.

So it seems the coarseness of the multi-AZ ELB load-balancing can be fully explained by the coarseness of using round-robin DNS to implement it.

Something else to note – it seems like the DNS entries *only* have 60 second lifetimes. With well-behaved DNS clients (of which I will bet there are depressingly few), you should at *least* be able to end up changing the AZ you’re banging into every 60 seconds. However, in my testing – brief though it may be – it seems to stay pretty ‘sticky’.

So what does this mean? I dunno – I feel like I want to do multi-AZ setups in AWS even less now. round-robin DNS is old-school, but at large enough scales does generally work. Though I wonder if heavily-hit web services API’s like the ones my company provides fit will enough into that framework? I’m not sure.

Session stickiness and multi-AZ setups

Another question – how does this affect ‘stickiness’? You can set an LB to provide sticky session support – but with this IP address shenaniganry, how can that possibly work?

Well, weirdly enough – it actually does.

I set an Amazon load-balancer-provided stickiness policy on my test LB. I curl’ed the name, and got the cookie. I then curl’ed the individual IP addresses for the load balancer, with that cookie set. And now, no matter which IP I hit, I keep getting the results from the same back-end server. So session-stickiness *does* break-through the load-balancer’s IP-address-to-AZ associations, to always keep hitting the same back-end server.

I wonder, what does the AWS-provided cookie actually look like? It seems like Hex, so let me see if I can decipher it.

Since I don’t know if anything scary is encoded therein, I won’t post my cookie here, but when I tried to decipher it, I just got a bunch of binary gobbeldygook. Stayed consistent from request-to-request (maybe modulo time, not sure), so probably just encodes an AZ and/or an instance ID (and maybe time).

Implications

So since AWS exposes some of the internal implementation details of your load-balancer setups, what does this mean? It certainly does imply that you can lower the bar for DoS’ing a website that’s ELB-hosted by just picking one of the ELB IP’s and slamming it. For a two-AZ example – as opposed to having to generate 2x traffic to overwhelm a site, you can just pick one IP and hit that one with 1x and have the site go half-down from it.

Considering the issues I’ve actually run into from having autoscaling groups that won’t scale because only one AZ is overwhelmed, I wonder if it makes sense to only have autoscaling groups that span a single AZ?

And it also seems to imply that you can DoS an individual server by hitting it with a session-cookie that requires it to always hit the same back-end server. So perhaps, for high-performance environments, it makes sense to stick with shared-nothing architectures and *not* do any kind of session-based stickiness?