Running Riak in Classic-EC2 (in a non-moronic way)

So I’ve been playing around with Riak lately – and it is a spectacular piece of software. It has a lot of the cool clustering goodness that you might get from something like Cassandra – but is a lot less of a pain in the ass to administer and deal with. Interacting with the database in an ad-hoc fashion is actually delightful – you can use lovely little REST URL’s like: http://my-riak-lb.something.com/riak/bucketname/key to fetch one particular key. Riak doesn’t care what you store there – you could certainly throw a JSON blob in there, but you can throw whatever else you might like too. Want a secondary index? Make sure to use a 2i-capable backend (Leveldb, or others), then declare what values you want indexed when you write your data. Riak doesn’t load-balance over your cluster instances, but there are a ton of perfectly good load balancer solutions that already exist that work great with it. And if you need something tighter/faster than HTTP REST URL’s, you can use their protocol buffers interface for something that’s more socket-ey. If you’re into that kind of stuff, check it out. I’m really excited about it.

But be wary if you’re running it in Classic AWS (though their advice on running in VPC seems solid). The advice they give on their documentation website is terrible and wrong-headed. They actually expect you to manually go in and reconfigure the nodes on your cluster if one of your instances has to reboot. I worked out a better way. Of course, I offered it back to them, but they have yet to take me up on it.

You should basically run it the same way you’d run any other long-lived server in AWS. Allocate an Elastic IP address for each node. Look up the cool “public DNS name” for each of your elastic IP’s (should look like: “ec2-1-2-3-4.compute-1.amazonaws.com”), and use either that or a nice CNAME that points to that as your host identifier. That way, when your instances inside AWS resolve the name, they get the inside IP addresses. If your instance has to reboot, just reassociate your EIP to the instance. And that’s it. Oh, and the bits in the config where you’re supposed to put your internal IP address as something to listen to? Just put 0.0.0.0, it works fine (though there are allegedly some possible problems with Riak Control and HTTPS that way; that’s what Basho tells me but I don’t really know anything about that).

And you should of course protect your Riak port the same way you’d protect any other port for any other internal-facing system. There. That’s it. I have shut down and restarted my entire cluster (forcing it to get brand-new IP addresses), and using this method it seemed to work just fine.

The method Basho proposes to handle a reboot is as follows: Stop the node. Mark it ‘down’ using another node. Rename the node to whatever your current private IP address is. Delete the ring directory. Start the node. Rejoin the cluster. Invoke the magic command “riak-admin cluster force-replace ” (Oh, I hope you remembered to keep track of what the old name was when you renamed it!) (Oh, and one more thing! Up until a few months ago that was misdocumented as “riak-admin cluster replace …” which would toss your node’s data). Plan, then commit the changes. If you like to do it that way, you are some kind of weird masochist. And if you think this is remotely a good idea, then you are not good at technology.

I got into a long back-and-forth with one of their engineers about their crazy recommendations, but he refused to budge. I even submitted them a pull request with a fixed version of the documentation – and they still haven’t merged it. Why? I have no idea. The general impression I got is that the guy I was talking to was just being obstinate. We were literally talking in circles. “Why don’t you just use VPC!?” I’m sorry dude, I can’t, because all my shit is on Classic and that’s where I’m stuck, for now. “But if you give it an elastic IP, now it’s accessible to the world!” No more or less so than if it just picks its own IP, buddy. “But you’ll be trying to talk to the outside IP!” No, those funny names resolve to inside IP’s when you’re inside of Amazon. “Well, you should just use VPC!” And so on. Literally, this dude told me to use VPC like 5 times in the course of our email exchange. When I was explaining how to use Riak in Classic-EC2.

So, yeah. Good database. Had a really nice leisurely chat with one of their sales guys, too – he seemed really cool and knowledgable and low-pressure. But this ‘evangelist’ fellow certainly makes that group seem pretty dysfunctional to me.

Amazon Elastic Load Balancer (ELB) performance characteristics

So at my new job, I get to use AWS stuff a lot. We have many, many servers, usually sitting behind a load-balancer of some sort. Amazon’s documentation on these things isn’t very clear, so I’m trying to figure out what the damned things are doing.

First off – a good thing. These Load Balancers are really easy to use. Adding a new instance is a few clicks away using the Console.

Another good thing – you can use multiple availability zones as a way to avoid trouble when an entire Availability Zone goes down (as happened less than a year ago). And here’s where it gets ugly.

It seems to evenly split traffic across zones – even if you don’t have a balanced number of instances in each zone. And it seems to determine which zone to hit based on some hash of the source address. So, if you have 2 instances in an Autoscale group, in two availability zones, and you hit your array from one IP really hard – bad things ensue. The one AZ you’re hitting will max out the CPU, and the AZ you’re not hitting will be nearly 100% idle. That adds up to 50% utilization on average – not enough to cause a scaling event (with my thresholds, anyway).

So, in short, if you have fojillions of people from all over hitting your services indiscriminately, I’m sure it’ll be fine. But, if like me, you have 1000’s of people (or so) from all over, some really hard from one particular IP – it may not be a good idea to try to spin up more than one AZ. And spinning up 4 AZ’s seems silly – you’ll definitely have more of a chance of at least one of them going bad, and until your load-balancer figures that out, you’ll have one out of ‘n’ requests failing.

Another thing I’ve noticed – the ELB’s seem to have a strict timeout on accessing the back-end service. If it doesn’t get a response within 30 seconds, it’s going to drop the connection and hit the service again. I had a service that was nearly getting DOS’ed by the load balancer that way. Make sure you have sane timeouts.

So the next thing I was curious about was whether or not the ELB would do any batching-together of any of the returned data – would it act like some kind of reverse-proxy? I wrote a tiny Node server which spat out a little data, waited 10 seconds, spat out some more, waited another 10 seconds, then finished. Here it is:


#!/usr/local/bin/node
"use strict";


var http=require('http');


http.createServer(function (req,res) {
  console.warn("CONNECTION RECEVIED - waiting 10 seconds: "+req.url);
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.write("initial output...n");
  setTimeout(function () {
    console.warn("first stuff written for "+req.url);
    res.write("Here is some stuff before an additional delayn");
    setTimeout(function () {
      console.warn("second stuff written for "+req.url);
      res.end('Hello Worldn');
    },10000);
  },10000);


}).listen(80);

So – I couldn’t tell any difference between hitting the server directly, or hitting the load balancer (if I telnetted straight to port 80). It acted the same way. There was definitely no batching.
What about if I have a slow-client, or something that’s trying to upload something – will it get batched together on the POST side?
I modified my code to do a ‘slow POST’ – and it worked similarly – I couldn’t actually tell the difference between running that on the load-balancer or running it on the instance directly.
I also wrote code to generate a large (1MB) response dynamically on the server-side, then wrote a client that would receive a little, pause the stream, resume it after a few seconds, pause it again, and so on. The one thing I noticed different between accessing the server directly versus the load balancer was that the server tended to give me data in 1024 byte chunks, whereas the load-balancer was giving me blocks closer to 1500 bytes. Weird! Well, maybe not – I *do* know for sure that the LB is reterminating the connection at the TCP level – the source IP address changes. I was writing the data in blocks of 1k, so maybe each write turned into exactly one packet of 1024 bytes? But in the LB side, the LB, when re-streaming my TCP data, was sending larger segments. Or so it would seem.
So it looks like I can’t get rid of the reverse-proxy that sits on top of some of our Ruby code. Oh well.