Creating a site-to-site VPN in AWS between regions

I spent crazy amounts of hours – days, really – doing this. I figured I might at least try and save someone else some time.

The solution I went with was a simple software-based VPN using AWS Linux instances in either region. I went with IPSec as my encryption/tunneling mechanism, and ISAKMP IKE as my method of sharing keys. I selected Libreswan as my VPN software. I evaluated and discarded several other potential solutions, but this is what I actually got to work for me.

Prerequisites:

  1. You need to know some intermediate networking. If you don’t, I don’t think I can help you.
  2. You need to have non-overlapping network ranges in either region. If you don’t, you’ll need to evaluate some other kind of solution instead.
  3. This isn’t a walkthrough; this is more of an explanation of my process, and I’m going to try and help give troubleshooting steps along the way.
  4. You should probably be pretty good at AWS in general already. If not, you’re gonna have a bad time.

Instructions:

Stage 1: Information gathering

You should have a ‘here’ network, and a ‘there’ network. Launch your instances. Allocate an EIP for either side (you wouldn’t want IP addresses to change). Spin up your boxes, and make sure they’re on public-facing subnets. Make sure you can ssh into them. Make sure they can speak ISAKMP IKE (UDP Port 500) to each other. Make sure they can speak IPSec NAT-T (UDP Port 4500) to each other.

Take note of the private network and netmask on the ‘here’ side, and the private network and netmask on the ‘there’ side. Take note of your EIP’s for either side too.

Stage 2: ISAKMP IKE

yum install libreswan

on both boxes. On either one, go to /etc/ipsec.d

Create a file with the name of the other side of the network, with a .conf extension. Put this in there:

conn name-of-connection-that-you-like
  left=my.internal.ip.address
  right=their.EXTERNAL.ip.address
  rightid=their.INTERNAL.ip.address
  leftsubnet=my_entire_private_subnet/CIDR-netmask (e.g. 1.2.3.4/20)
  rightsubnet=their_entire_private_subnet/CIDR-netmask
  authby=secret
  auto=start

Do something similar (but not the same, as you can see) for the other side.

On both sides, create a file in that same directory with name-of-connection-that-you-like.secrets <— yes, that’s plural. Contents should be:

my.internal.ip.address their.internal.ip.address : PSK ""

In between the quotes put an extremely long random string, and make sure it matches on both sides of the network. Here’s an example way to get a nice long string for a PSK –

dd if=/dev/random bs=1 count=33|base64

The reason I don’t show it in there with the string already is I don’t want someone to copy-paste this and then have some stupid password already embedded.

On both sides, make sure to do:

/sbin/chkconfig ipsec on

Okay, now it’s time to see if we screwed something up. Don’t worry, you almost certainly did.

service ipsec start

(Do that on both sides). So, now we should try and see if we can at least have the tunnel up and running. Check out /var/log/secure to see if you see successful-looking things in there. It should say something about ‘connection established’ or something like that. Do a ping from one of your VPN boxes to the inside IP address on the other box. The do the same from the other side. Don’t expect to be able to ping further into the network either way yet. We haven’t done that bit.

If you can ping across both ways, then you have gotten over the worst hump. The rest is much easier. You can skip to stage 2z. But if it still doesn’t work, read on.

Stage 2a: Troubleshooting 🙁

So your pings across didn’t work. That sucks, and being in that sucky state would probably be around 75% of the time I spent working on this solution. It’s hard, because you don’t have very good troubleshooting tools.

First off, /var/log/secure – this was the one that clued me in to the rightid setting that I settled on, above. I was getting messages saying “No configuration found for (other-sides-private-ip).” Maybe if you’re lucky, you’ll see something glaring in that log.

Any time you make changes to your config or your secrets file, be on the safe side and do service ipsec restart

I tried a lot of ipsec status as well. That had a lot of gobbledygook, but sometimes gave me some decent clues.

Some other handy tools were ip xfrm help which was quite nice and led me to: ip xfrm monitor gives you a real-time-ish monitoring status of what the various IPSec things are doing.

And remember, all we should be able to do is ping from ‘here’ to the private IP address of ‘there’. Nothing more (yet).

Stage 2z: Your Future Self will thank you

Okay, now comes a sneaky, but important part: reboot both ends. Once they’re back up, make sure you can ping across just as before.

I caught that I had forgotten my chkconfig on both sides by doing this. I also learned about the auto=start setting that I recommend above. It’s better to do this NOW rather than have someone call you up at 5am and have you scrambling to remember how you set up all this VPN stuff months and months ago.

Stage 3: Routing

Okay, you can ping from your gateway across your tunnel, but not any further. Let’s see if we can get some routing going to make your VPN actually useful.

Stage 3a: One-and-a-half hops

From the gateway on this side of the tunnel, we want to ping across the tunnel and to some device on their side of the network.

First, in the AWS Console, there is a setting in EC2 that you will need to change on the far side gateway instance. It’s called “Enable src/dest checks”. Disable those checks; they make sense for normal servers but they won’t work on a router.

Next, on your far-side instance, go into /etc/sysctl.conf and enable routing:

net.ipv4.ip_forward = 1

And also turn this weird setting off:

net.ipv4.conf.default.rp_filter = 0

I don’t know what that second one means, but something told me to turn it off (ipsec verify maybe?) so I did. Safest bet is to reboot the instance after that, but there’s also some way of reloading the sysctl.conf – I just don’t know what that is.

Next, let’s check that you have a reasonable security group set on your far-side gateway. It should already be allowing the various VPN protocols across – and that’s nice. But it also needs to have the ability to talk to whatever far-side instance you want to talk to. Test by ssh’ing into your far-side gateway, and do a ping or a curl or whatever to your far-side instance. It should work. If it does, your security group situation is probably OK.

Last couple of steps now. In the VPC settings in the AWS Console for the far-side region, add a route for the near network, via the far-side gateway. Repeat this for any different routing tables you might have (I have private and public routing tables, for example).

Now, ping across to the far instance. If that works, congrats! It took me a little fiddling – I had missed a security group, and I had missed one of my routing tables. But I eventually got there. Hopefully, you will too. There just aren’t as many moving parts to deal with here, luckily.

Things to keep in mind: if you can ping across the tunnel, and if your far-gateway can ping to your far-instance, then the issue almost has to be routing-related. Either you messed something up in one of your routing tables, or you messed up the far instance and it’s not forwarding correctly.

Also, don’t try and jam everything all into one security group, I think that gets messy. I use one Security Group which allows the VPN protocols, and another security group that grants access to inside-ey instances. A little neater that way, IMHO.

Stage 3b: One and a half hops the other way

Now repeat the same process starting with the Far gateway, to a Near instance. Same rules apply. You should be able to ping, or even curl or whatever other protocol that you want to be able to use across the VPN. You’ll be checking security groups and adding routing tables and so on as before, but this time for the far network onto the near VPC routing tables.

Stage 3c: Two Full Hops

This one is mostly to get you to feel better. Try pinging from your near-instance to your far-instance. It should work. And vice-versa.

CONGRATS, YOU DID IT!

Stage ∞: Advanced

Resiliency

Some stuff to think about – you have an entire VPN hinging on two crappy little AWS boxes. You sure you’re ok with that? I don’t think I am. Consider maybe spinning up another pair of boxes, and having redundant routes in your routing tables. Hopefully the AWS magic will let the remaining route take over for the broken one, if there is a break. But I would test that. Two redundant routes is near-useless if every other packet goes over a dead link.

Routing

Once you have enough point-to-point connections that you’ve set up, trying to maintain routing tables everywhere starts to become a pain in the ass. Consider implementing some routing. I would probably stay away from RIP, even though it’s really easy, because I would fear its routing tables might not converge fast enough. But, it’s also easy – so maybe I would at least give it a try. OSPF is another option, but does have some complexity to it. IS-IS makes you sound like you’re on a watchlist, but is another interior routing protocol. BGP is the big daddy here, but my impression of it was that it was better at large routing tables – like, uh, the Internet.

But I suspect, if you were able to get any of these working, any of them would be OK-ish. One concern I might have is that you can’t influence the AWS Routing tables from your instance; I don’t know exactly how that would work. Would you have to route all traffic down to your own internal router instance? That would suck.

VPC VPN Gateway Service from AWS

Amazon has a VPN Gateway service, but it is really designed to hook the VPN device you have in your office to your Amazon network. Why they would be so strangely short-sighted, I have no idea. But a lot of my experimentation was trying to get this service set up in my ‘spoke’ networks, and then have a central network that is a pseudo-“Customer Gateway”. They support BGP and other stuff there, too. In trying to mess around with this, I had trouble because they really wanted some crazy 169.254.x.y/30 network to be either side of the gateway, and they want two tunnels per VPN Gateway device. Which is weird, because you only get one tunnel back as the ‘customer gateway’ but, whatever. Anyways, there might be something here – and if you go with routing this might be the only way for you to go.

Conclusion

The AWS Marketplace was really, really, really embarrassingly bad when it came to this. The Linux documentation on what tools to use here is a total shambles. There are apparently two competing sets of userland tools – the blah-swan stuff and ipsec-tools. I basically went with the one that had more examples written and better documentation. There was lots of weird stuff on the AWS Linux (which is really CentOS). Like, if you followed their (quite lovely) documentation, you would get some bizarre hard failures, and googling didn’t get me anywhere.

If this is anything like the last time I did all kinds of crazy networking work in AWS, I will end up finding out that Amazon releases “Amazon EZ VPN Interconnect Service” in a few weeks for like 2¢ a month or something. Oh well. Until then, we have the stuff I listed above.

Hope I at least saved you a bit of time, if nothing else.

One thought on “Creating a site-to-site VPN in AWS between regions”

  1. Sigh. Of course, AWS announced a new “inter-region VPC-connect” which obviates pretty much this entire article.

    Oh well. Still, I definitely was able to up my troubleshooting game by trying to set up the VPN as I mentioned in this article.

Leave a Reply

Your email address will not be published. Required fields are marked *