Daily AWS Wtf: An endless stream of thoughts, wtf's and fixes for the latter on Amazon Web Services.
August 12, 2009 at 4:15pm
0 notes

Creating EC2 AMIs With Fresh SSH Host Keys

Having a specialized EC2 AMI for your application is very useful and allows you to have one common platform to test and deploy updates. Creating and maintaining EC2 AMIs is not difficult. You can use the EC2 AMI tools by hand or the EC2 console.

When you create or update an AMI from a running EC2 instance (bundling), there are a couple of things to think of:

  • Clear log files
  • Clear AWS keys
  • Clear shell history
  • Do not bundle /tmp

If you want to know more, Eric Hammond, who maintains the wonderful Alestic Ubuntu AMIs, recently gave an informative presentation on creating AMIs.

A small hint that I want to share with you is creating fresh/new SSH host keys on your AMIs. The Alestic Ubuntu images create a new key par for the SSH daemon at boot time. If you use those as a base, usually you go on and install a couple of packages and make your changes. Then you re-bundle and distribute your AMI. The problem is that the SSH host keys have now been generated and every instance (either yours or other peoples if you share your AMI) uses those very same keys. This is bad from a security viewpoint (Man-in-the-middle attack).

So what you want to do is to re-generate the SSH keys before you bundle the AMI. Deleting the keys alone will not work as sshd will not function without the keys present. So you need to generate some at the first boot. But only at the first boot, on subsequent boots the key should stay.

The Alestic AMIs solve this nicely by having a /etc/init.d/ec2-ssh-host-key-gen script that will delete the existing keys and generate new ones. At the end of the script, it makes itself un-executable. This will prevent future runs of the script.

So if you want to re-bundle your AMI with fresh SSH keys, just make this script executable again and then bundle the AMI. This will create AMIs that will generate a new key pair on boot and keep it.

# chmod a+x /etc/init.d/ec2-ssh-host-key-gen

@jweiss

Comments (View)
August 11, 2009 at 10:52pm
0 notes

You Could Have It So Much Better With Nanite And Redis

Nanite is a perfect match for a computing cloud (you know I’m talking about EC2, right?), its major feature is that you can assemble a cluster of daemons dynamically. In the cloud, where resources come and go, that’s just what you want. It scales pretty well, and whenever you fire up new agents to share increasing work load, they register themselves in the cluster, and are immediately available to handle requests.

The work is distributed by another tier consisting of mappers. They delegate the work to the agents they know based on an algorithm of your choosing. For that to work all mappers must know as many of the agents as possible, all of them would obviously be the best.

Agents send out heartbeats which are by default picked up by all mappers. They update their internal status map of the agents they know. The status is removed as soon as an agent terminates or times out.

Requests (meaning someone wants the cluster to get off their lazy butts and do something) usually go through a mapper, that’s where you normally send them from. However, you can also send requests from one agent to another. Those requests by default again go through all mappers and is then sent to the appropriate agents.

Hold on! Picked up by all mappers, and sent out again? Yes, that’s right, and usually not something you want.

Here’s another scenario: One of your mappers goes down, or even several of them, or you need to bring up new mappers because your existing ones have too much work to do. When they go down, all their knowledge of agents is lost, so when they come back up, it takes a while until they pick up the heartbeats of all the agents in the cluster. You’re right thinking that sucks. Because it does.

I’ve contemplated whether that’s a bug, but think about it. When you don’t have shared state amongst all the mappers, how’s an agent to know if all the mappers know of all the other agents in the cluster? What if the one mapper that picks up it currently has no knowledge of the agent that handles the service you requested?

But wait! The solution is at hand. Nanite can use redis to store the agent’s status. Wait, it gets better. When you use Nanite with redis, only one and exactly one mapper picks up the messages from both the heartbeat and the request queue. There’s also a third queue involved, which is used for registration. That too will only end up at one mapper.

Now that’s neat, because the one mapper picking up the message will update the status in redis, and just like that all the other mappers know about new and terminated agents. You don’t lose anything. If a mapper goes down the knowledge about the agents is still in the cluster, and if it comes back up again, it immediately knows about all the agents and their services. redis ensures that your work will be evenly spread out across your cluster, so do yourself a favor and use it. Not exclusively with Nanite of course, but it’d be a start.

But what if redis goes down? It’s not really a big deal is it? If you take the appropriate measures and secure undeliverable messages in an offline queue (yeah, Nanite has support for that), you can just bring it back up and your cluster will pretty much come back up all by itself. If you want to ensure that can’t happen, you can replicate redis. Or rely on the redis Ruby library and its ring-hashing. Did I mention that redis can store the data on disk as well? Yeah, that’s nice too. Not 100% reliable as the data is written asynchronously, but still good enough.

That’s why you want to use redis with Nanite. As soon as you have several mappers running in your Nanite cluster, you’re doing its health and scalability a big favor.

After reading all this you totally deserve a pro tip: When you work with redis, be sure to keep your mapper’s timeouts longer than the ping time of the agents. Why? Because I tore my hair out figuring out why my cluster forgot its agents every now and then. If you keep your agent’s ping times longer or as long as the timeouts on the mappers, chances are that an agent will time out on one of them which will in turn remove them from the cluster until their next ping. We’re not talking about once in a million situation, it’ll happen, trust me.

@roidrage

Comments (View)
August 10, 2009 at 3:12pm
0 notes

Identify an EBS Volume's Filesystem Type

When you manage cloud infrastructure automatically (with Chef maybe?), it’s not a given that you always know if an EBS volume has been properly initialized with a filesystem. Before you start parsing output from parted or fdisk, you should be aware of the awesome tool blkid. Its purpose is just what we need, to output a device’s filesystem, and to give a proper return code if it doesn’t have any filesystem, or there’s no device at all. We’re speaking in terms of Linux in this case, your mileage may vary with other operating systems.

The default output is to display some variables and their values. Nice to parse and all, but we really only want the filesystem type, so we use the following command:

blkid -s TYPE -o value /dev/sdi

Sweet! On a properly initialized volume it’ll just print out the filesystem type, and otherwise it won’t print anything and return 2. I’m sure you’ll figure out how to use that in your automated infrastructure.

Comments (View)
11:05am
0 notes

Announcing the Amazon Web Services User Group Berlin

We’re happy to announce the first meetup of the Berlin Amazon Web Services User Group. It’ll be on September 28, starting at 8 pm. You can RSVP on Amiando or on Xing. Location is still to-be-announced. We’ll keep you posted.

Special guest on this occasion will be Martin Buhr, European Business Directory - Amazon Web Services. So if there’s anything you ever wanted to know, drop by and say hi! Of course, if you have your own wtf’s with AWS, feel free to take the opportunity to talk about them.

Comments (View)
August 6, 2009 at 2:55pm
0 notes

Elastic Load Balancer and EC2 instance bandwidth

So we are working on a caching-related project on EC2. In this scenario high performance is very important.

We set up a Varnish cluster on EC2 and evaluate if it can replace an existing caching infrastructure in terms of costs and requests per second. Our benchmarks yielded some interesting results. It seemed that for our caching scenario the limiting factor is bandwidth. Varnish is very humble with CPU/RAM consumption. We could easily deliver 500 to 600 requests per second with a small instance and have the box idle around 95% (uncompressed content).

It turns out we are limited by bandwidth and not by CPU.

In our benchmarks we were only able to push 35 MB/s on small instances. So the actual requests per seconds were dependent on the object size we were pushing. The limit was always ~35 MB/s. Our typical HTML pages were around 50 to 70 KB, so we couldn’t reach the desired requests per second as our instance was at its bandwidth limit.

Usually when one instance hits its resource limits you load balance multiple ones. HAProxy is a fine example for a very robust TCP/HTTP load balancer. The problem is though, that it will not increase your bandwidth as all your traffic has to go through this one HAProxy instance. So even when you load balance multiple instances, each one is capable of pushing ~35 MB/s (—> ~350 MB/s with 10 small instances), the bottleneck will still be at ~35 MB/s (aka the load balancer).

So any load balancing that is driven by an EC2 instance will limit the bandwidth. The maximum you can get is the load balancer’s bandwidth. If you want/need more than that there is only network level load balancing left. Some more advanced load balancing solutions (ARP/router level) offer features like this.

The question was, can Amazon’s Elastic Load Balancer do this?

After setting up an Elastic Load Balancer configuring multiple instances as the backends the answer was: No, it can’t.

It seems like the Elastic Load Balancer is also limited to one, max. 1 Gig Ethernet connection (maybe also just a small EC2 instance?) and thus cannot increase the bandwidth over 35 MB/s. This is even more critical if you use larger instances as it actually decreases your bandwidth. More in a minute.

So with Elastic Load Balancer out of the question, the only available solution on EC2 is DNS Round Robin. There, you’ve heard it. Yes, the old and ugly DNS Round Robin.

DNS Round Robin will allow you to increase your bandwidth with every entry/instance you add. The only problem is that it is a bit inflexible and you can’t route the traffic yourself as DNS clients are picking each target/instance as they like. For a small number of instances (2-4 maybe) it is tolerable and solves our bandwidth problem.

Further, it seems the bigger the instance, the more bandwidth you have. Amazon does not guarantee any bandwidth but on the XL instances we guess that you have a physical server for yourself, so you can this box’ bandwidth for yourself. On the smaller instances the bandwidth is shared and thus can also be worse than our benchmarks.

So our solution is to use DNS Round Robin for wo to three HighCPU medium instances. This proved to be very cost-effective and the HighCPU medium instances push out more bytes per second than the small instances.

A follow-up post will show the exact number for each instance.

@jweiss

Comments (View)
August 5, 2009 at 12:03pm
0 notes

Running RabbitMQ on EC2

So running RabbitMQ on EC2 is pretty straight-forward, no surprises there. Get the correct packages from their website, and install them. We did just that, and bundled our own images which are fueled by some user data to configure the initial vhosts and users with their permissions.

From time to time that would fail, and the running RabbitMQ was busted. Killing it and restarting brought the cure, but hey, we’re in the cloud, things should work automagically.

When I checked the running instance, it showed that its node name was “rabbit@(node).” Everyone I asked who knew Erlang agreed that it shouldn’t say “(node)” but instead have a real host name in there.

So my guess was that RabbitMQ is just booting too early. Its start level number, at least on Ubuntu, is 20, which is quite early for a service like it. We’re using the Alestic images as a base, and I had a look at the custom start script it’s using to fetch user data from Amazon’s web service. Turns out these scripts use the following line to wait until the network is up:

perl -MIO::Socket::INET -e '
until(new IO::Socket::INET("169.254.169.254:80")) {
  print"Waiting for network...\n";sleep 1
}'

So I put this in the RabbitMQ init script, just before the commands that start the server, and I also bumped up the start level, and rebundled the image. Lo and behold, no problems since then.

Another issue: Mnesia (used by RabbitMQ to store routing information and some more data) apparently saves the current IP address or host name. So if you want to bundle your own RabbitMQ image, be sure to delete /var/lib/rabbitmq/mnesia before doing so. When starting up, RabbitMQ will recreate that directory.

Comments (View)
August 4, 2009 at 7:34pm
0 notes

Permissions for Private EC2 AMI Images on S3

We had a weird issue with EC2 not being able to read our AMI images stored on S3. It reported an error 403 with the suggestion to check the ACL permissions. That’s not very helpful all by itself, if the permissions seem fine with regards to its owner.

Turns out EC2 uses a special user id for fetching the images, and that user needs read access to the files in your bucket. Behold the user’s name, it’s za-team. Obviously you won’t be able to just go ahead and use S3Fox to set permissions for that user with just the name, you need the the hashed id in its full glory. Let me save you the trouble, here it is: 6aa5a366c34c1cbe25dc49211496e913e0351eb0e8c37aa3477e40942ec6b97c.

Neat, huh?

Also, beware of using S3Fox to recursively set permissions on “folders.” It might just remove the user za-team without you noticing it.

Comments (View)
7:23pm
0 notes

Disassociating an Elastic IP

When you do that, it takes several minutes for EC2 to pick it up and automatically assign it a new IP from the pool. Unless of course you assign it a new IP from your pool of Elastic IPs.

Comments (View)
7:03pm
0 notes

Wtf?

Working with the Amazon Web Services every day, you’re bound to come across oddities and bits that aren’t really documented anywhere. That’s what this is about. Bloopers, little bits, code samples, it’ll all go in here.

Comments (View)