So I’ve been playing around with Riak lately – and it is a spectacular piece of software. It has a lot of the cool clustering goodness that you might get from something like Cassandra – but is a lot less of a pain in the ass to administer and deal with. Interacting with the database in an ad-hoc fashion is actually delightful – you can use lovely little REST URL’s like: http://my-riak-lb.something.com/riak/bucketname/key to fetch one particular key. Riak doesn’t care what you store there – you could certainly throw a JSON blob in there, but you can throw whatever else you might like too. Want a secondary index? Make sure to use a 2i-capable backend (Leveldb, or others), then declare what values you want indexed when you write your data. Riak doesn’t load-balance over your cluster instances, but there are a ton of perfectly good load balancer solutions that already exist that work great with it. And if you need something tighter/faster than HTTP REST URL’s, you can use their protocol buffers interface for something that’s more socket-ey. If you’re into that kind of stuff, check it out. I’m really excited about it.
But be wary if you’re running it in Classic AWS (though their advice on running in VPC seems solid). The advice they give on their documentation website is terrible and wrong-headed. They actually expect you to manually go in and reconfigure the nodes on your cluster if one of your instances has to reboot. I worked out a better way. Of course, I offered it back to them, but they have yet to take me up on it.
You should basically run it the same way you’d run any other long-lived server in AWS. Allocate an Elastic IP address for each node. Look up the cool “public DNS name” for each of your elastic IP’s (should look like: “ec2-1-2-3-4.compute-1.amazonaws.com”), and use either that or a nice CNAME that points to that as your host identifier. That way, when your instances inside AWS resolve the name, they get the inside IP addresses. If your instance has to reboot, just reassociate your EIP to the instance. And that’s it. Oh, and the bits in the config where you’re supposed to put your internal IP address as something to listen to? Just put 0.0.0.0, it works fine (though there are allegedly some possible problems with Riak Control and HTTPS that way; that’s what Basho tells me but I don’t really know anything about that).
And you should of course protect your Riak port the same way you’d protect any other port for any other internal-facing system. There. That’s it. I have shut down and restarted my entire cluster (forcing it to get brand-new IP addresses), and using this method it seemed to work just fine.
The method Basho proposes to handle a reboot is as follows: Stop the node. Mark it ‘down’ using another node. Rename the node to whatever your current private IP address is. Delete the ring directory. Start the node. Rejoin the cluster. Invoke the magic command “riak-admin cluster force-replace
I got into a long back-and-forth with one of their engineers about their crazy recommendations, but he refused to budge. I even submitted them a pull request with a fixed version of the documentation – and they still haven’t merged it. Why? I have no idea. The general impression I got is that the guy I was talking to was just being obstinate. We were literally talking in circles. “Why don’t you just use VPC!?” I’m sorry dude, I can’t, because all my shit is on Classic and that’s where I’m stuck, for now. “But if you give it an elastic IP, now it’s accessible to the world!” No more or less so than if it just picks its own IP, buddy. “But you’ll be trying to talk to the outside IP!” No, those funny names resolve to inside IP’s when you’re inside of Amazon. “Well, you should just use VPC!” And so on. Literally, this dude told me to use VPC like 5 times in the course of our email exchange. When I was explaining how to use Riak in Classic-EC2.
So, yeah. Good database. Had a really nice leisurely chat with one of their sales guys, too – he seemed really cool and knowledgable and low-pressure. But this ‘evangelist’ fellow certainly makes that group seem pretty dysfunctional to me.