<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>An endless stream of thoughts, wtf’s and fixes for the latter on Amazon Web Services.</description><title>Daily AWS Wtf</title><generator>Tumblr (3.0; @dailyawswtf)</generator><link>http://dailyawswtf.com/</link><item><title>A Word on EC2 Instance Proximity</title><description>&lt;p&gt;This morning I had an incident that wasn’t a first for me, so I investigated a bit further. I got the feeling that two instances I launched ended up being on the same physical machine.&lt;/p&gt;

&lt;p&gt;Disclaimer: This is in no way a fully forensic investigation, I’m merely putting some data together. The only true way to find out where the instances are located would be to walk into Amazon’s data center and check, but I’m not too sure that’s gonna be possible any time soon. Feel free to add additional information in the comments. If I’m totally off with my guess, please let me know.&lt;/p&gt;

&lt;p&gt;I tested in the EU regions, so your mileage may vary in others. I’m not posting this to make EC2 look bad. These are musing and findings, and I find it quite interesting to look into stuff like that, if only to find out why the instance’s booting was so gosh darn slow.&lt;/p&gt;

&lt;p&gt;I fired up two small EC2 instances in the same availability zone while doing some testing for &lt;a href="http://scalarium.com"&gt;Scalarium&lt;/a&gt;. Our workflow involves initial provisioning right after boot which usually takes quite a while on small instances.&lt;/p&gt;

&lt;p&gt;This time it was different, because the provisioning part took a lot longer than usual, on both instances, at least twice as long. This has happened before and it reminded me of this paper on &lt;a href="http://cseweb.ucsd.edu/~hovav/dist/cloudsec.pdf"&gt;“Exploring Information Leakage in Third-Party Compute Clouds”&lt;/a&gt;. Recommended read if you’re spending quality time in the cloud and want to know if and how others are able to guess the physical location of your instance.&lt;/p&gt;

&lt;p&gt;Guessing just from the load is a long shot, but it’s an indication. The instances were launched almost simultaneously, and according to the paper EC2 launches instances across machines in order. So two instances of the same type, launched in the same availability zone have a good chance of being on the same physical machine.&lt;/p&gt;

&lt;p&gt;The paper mentions some other indicators, so I investigated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Close IP proximity. The last digit in the first hop’s IP address was 2 on the first, and 3 on the second instance. Same subnet, IP increased by one. Supposably that hop is the Dom0 address, which goes back to the hypervisor.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Traceroute. Doing a traceroute from one instance to the other had only one hop. I tracerouted another one I launched some time later: Four hops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Short packet round trip. The round trip was short, usually less than 0.5ms, but the same was true for the third instance. That’s throwing off the whole initial theory a bit, because maybe that means the two initial instances didn’t run on the same machine. You could argue that the pings on the first two instances were a bit faster, but not all the time, we’re talking about a scale of less than 0.05ms. I’m not sure if that counts.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The paper mentioned a warm-up phase that happened before the first ping or traceroute hop. I could reproduce that. Also, every 10 to 15 pings one would take a little bit longer than the others. The phase happened on all machines on all pings and traceroutes.&lt;/p&gt;

&lt;p&gt;There’s other probes you can do, but they’re a bit out of my forensic league. The indicators above made me suspicious, even just the first one alone. Maybe they’re not on the same physical host, chances look pretty good to me though. If you have more suggestions on things I could try, the instances are still running. Let me know.&lt;/p&gt;

&lt;p&gt;Do yourself a favor and read the paper, hard to swallow at times, but interesting stuff. some people say that Amazon has changed some internal details in EC2 already making some of the issues obsolete, but it sure doesn’t seem to be the case here.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/334512725</link><guid>http://dailyawswtf.com/post/334512725</guid><pubDate>Thu, 14 Jan 2010 20:55:42 +0100</pubDate></item><item><title>SimpleDB Gotcha - A Follow-Up</title><description>&lt;p&gt;Two weeks ago we posted something about the &lt;a href="http://dailyawswtf.com/post/225746572/simpledb-gotcha"&gt;terms and conditions of SimpleDB&lt;/a&gt; and how they simply stated that data could be deleted after six months.&lt;/p&gt;

&lt;p&gt;Thankfully Amazon didn’t leave that rather ambiguous claim as it is and recently updated their customer agreement appropriately:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;5.8.2. […] If during the previous six (6) months you have incurred no fees for SimpleDB and have registered no usage of Your Amazon SimpleDB Content, we may delete, without liability of any kind, Your Amazon SimpleDB Content upon thirty (30) days prior notice to you.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Still not great to know your data might get deleted, but now you know a lot better under what conditions Amazon reserves the right to clean up after you, and they’ll notify you early enough before they would actually do so.&lt;/p&gt;

&lt;p&gt;Thanks Amazon, I’m glad you cleared that up!&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/roidrage"&gt;@roidrage&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/241567795</link><guid>http://dailyawswtf.com/post/241567795</guid><pubDate>Thu, 12 Nov 2009 16:56:38 +0100</pubDate></item><item><title>SimpleDB gotcha</title><description>&lt;p&gt;While carefully reading through the &lt;a href="http://aws.amazon.com/agreement/"&gt;AWS Customer Agreement&lt;/a&gt; we found this interesting paragraph:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;5.8.2 […] We may delete, without liability of any kind, any of your Amazon SimpleDB Content that has not been accessed in the previous 6 months.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Ouch!&lt;/p&gt;

&lt;p&gt;While SimpleDB keeps surprising us, for our EC2 cluster management platform &lt;a href="http://www.scalarium.com"&gt;Scalarium&lt;/a&gt; we switched to CouchDB and &lt;a href="http://www.paperplanes.de/2009/10/27/theres_something_about_redis.html"&gt;Redis&lt;/a&gt; some time ago. Turns out SimpleDB is sometimes too simple.&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/jweiss"&gt;@jweiss&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/225746572</link><guid>http://dailyawswtf.com/post/225746572</guid><pubDate>Wed, 28 Oct 2009 09:58:00 +0100</pubDate></item><item><title>Amazon Announces New EC2 And AWS Features</title><description>&lt;p&gt;Today started with a bang, there’s no doubt about that. With three simple announcements Amazon introduced two new features for their Elastic Compute Cloud: &lt;a href="http://aws.typepad.com/aws/2009/10/introducing-rds-the-amazon-relational-database-service-.html"&gt;Relational Database Service&lt;/a&gt; and &lt;a href="http://aws.typepad.com/aws/2009/10/two-new-ec2-instance-types-additional-memory.html"&gt;new high memory instance types&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Until now EC2 the highest memory you could get was 15.5GB. While that’s a lot, it’s not enough for some applications where databases have out-grown that amount of memory. If that was your excuse so far to not give EC2 a spin, you’re finally out of excuses. Today Amazon introduced two new instances types, one with 34GB and one with 68GB of memory. Cashing in at $1.20 and $2.40 respectively, they aren’t cheap, but they’re now officially an option.&lt;/p&gt;

&lt;p&gt;The bigger deal though is the &lt;a href="http://aws.amazon.com/rds/"&gt;Relational Database Service&lt;/a&gt;, inarguably targeted as a competitor to Microsoft’s Azure. While I first thought of it as something similar to SimpleDB, it’s in fact a managed MySQL instance, with automated backup, maintenance and everything. You tell Amazon the time window for backups and maintenance, and you’re good to go. The prices are slightly higher than normal EC2 instances, but consider for a moment what you get in return. Managed MySQL is no picnic price-wise, and I’d consider this a really good value. Amazon uses the maintenance window to patch the database, upgrade your storage, and whatnot.&lt;/p&gt;

&lt;p&gt;Of course you have to factor in the potential downtime in your application. There’s no way of knowing if Amazon will use the full maintenance window of four hours, or just a small part of it, or not at all for any particular week.&lt;/p&gt;

&lt;p&gt;What’s interesting is the list of features they’re planning on adding in the future, in particular the automated replication across different availability zones. The mind boggles how they’re going to implement that, because MySQL’s replication sucks balls. Given the fact that you have no way of looking deeper into MySQL’s log files without proper access to the machines I’m curious how they’re going to reliably solve that problem.&lt;/p&gt;

&lt;p&gt;Which brings me to the downsides of the new RDS instances. There’s apparently no way to log into them from the outside. At least no obvious way, because the API doesn’t allow you to specify SSH keys like the EC2 API does. Also, user data doesn’t seem to be supported either. It sort of makes sense because they’re fully managed, but it’s still a bit of a bummer in my opinion.&lt;/p&gt;

&lt;p&gt;While managed MySQL in itself is awesome, a deeper look at the &lt;a href="http://docs.amazonwebservices.com/AmazonRDS/latest/APIReference/"&gt;API&lt;/a&gt; reveals some interesting details. You can modify a running RDS instance. You can increase the available storage at runtime, and you can change the instance type at runtime. Doing so usually results in an outage, but you can also make use of the maintenance window when you upgrade storage, which is the default.&lt;/p&gt;

&lt;p&gt;These are the kinds of features I find particularly interesting, and I’m wondering if and when they’ll make it to EC2. Sure, you can automate all that stuff, but it’d still be a nice feature.&lt;/p&gt;

&lt;p&gt;Questions I’m left with are: Is the backup fully reliable? If it is, do they lock the database for it? Same for snapshots I can do via the API. As always I’d wish for the API documentation to reflect potential side effects. And as usual, why only for the US? While the new instance types are available in both the US and in the EU, us EU customers are once again left waiting for the Relational Database Service.  In terms of supported engines, what’s the future have in stock for us? Are we talking different database systems, e.g. Postgres, or different MySQL engines, like XtraDB (which would be sweet).&lt;/p&gt;

&lt;p&gt;Oh yes, and &lt;a href="http://aws.typepad.com/aws/2009/10/amazon-ec2-now-an-even-better-value.html"&gt;EC2 pricing got cheaper&lt;/a&gt;, quite notably for larger instances, but that doesn’t excite me as much as the other two announcements of the day.&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/roidrage"&gt;@roidrage&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/224765091</link><guid>http://dailyawswtf.com/post/224765091</guid><pubDate>Tue, 27 Oct 2009 11:40:31 +0100</pubDate></item><item><title>How About Reserved S3 Storage?</title><description>&lt;p&gt;During lunch with the technical leads from &lt;a href="http://www.soundcloud.com"&gt;SoundCloud&lt;/a&gt; they floated a simple yet genius idea, especially for customers with very large storage requirements on S3 like they are: Reserved Storage. Reserve so many terabytes per year for a certain amount of time, pay an upfront amount for that, and get greatly reduced monthly storage costs. Sounds like an excellent idea to me, and would be an very logical addition to Reserved Instances.. They asked their local AWS evangelist about it already, so who knows? Maybe it’s somewhere on the horizon.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/214615906</link><guid>http://dailyawswtf.com/post/214615906</guid><pubDate>Fri, 16 Oct 2009 13:28:00 +0200</pubDate></item><item><title>Does Greatly Increased Network Traffic on EC2 Instances Decrease EBS Performance?</title><description>&lt;p&gt;I’m sort of throwing this out there. &lt;a href="http://www.bitbucket.org"&gt;BitBucket&lt;/a&gt;, a source code hosting service based on Mercurial and running on EC2 and off EBS volumes, had a &lt;a href="http://blog.bitbucket.org/2009/10/04/on-our-extended-downtime-amazon-and-whats-coming/"&gt;long downtime recently&lt;/a&gt;, and it was apparently caused by a DDoS attack flooding their servers with spoofed UDP traffic.&lt;/p&gt;

&lt;p&gt;Putting aside the initial oddness with Amazon’s support and that having such an insanely long downtime sucks big time, one thing is quite interesting to consider.&lt;/p&gt;

&lt;p&gt;Access to their EBS volumes was horribly slow while the network traffic peaked out. The network traffic increase seemed to correlate directly to the decrease of EBS volumes, which in turn are read from and written on over the network. The facts at hand leave me to assume that EBS volume access happens over the same network interface, be it virtual or physical, as the normal network access to the EC2 instance.&lt;/p&gt;

&lt;p&gt;Sure, we’re talking about an extraordinary peak in traffic, but I’m quite curious how the balance works out when you have say daily peaks during the evening hours, which involve heavy I/O on your EBS volumes, for whatever reasons.&lt;/p&gt;

&lt;p&gt;You could and you should argue that this shouldn’t happen a lot, BitBucket is quite a special case. You should keep as much data on external storage like S3, but services like BitBucket can’t rely on that, they need the data on disk, the same is true for databases.&lt;/p&gt;

&lt;p&gt;It’s hard to think of a simple and universal solution for this, as is always the case for DDoS attacks. The traffic needs to be capped above the level of the instances, which is exactly what Amazon did in this case.&lt;/p&gt;

&lt;p&gt;In the end I hope I’m wrong with the assumption ventured in the subject, or that at least that it will be fixed in the near future.&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/roidrage"&gt;@roidrage&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/206582085</link><guid>http://dailyawswtf.com/post/206582085</guid><pubDate>Wed, 07 Oct 2009 11:24:59 +0200</pubDate></item><item><title>Amazon Web Services and the EU</title><description>&lt;p&gt;Just recently Amazon gave us European EC2 customers a well-deserved feature update. We got CloudWatch, Elastic Load Balancing and Auto Scaling. As an added bonus we also finally got SimpleDB access, which means less lag and no traffic costs when using SimpleDB from a European EC2 instance. Let me take that hook to talk about Amazon Web Services and the EU.&lt;/p&gt;

&lt;p&gt;That’s right, today’s Wtf is not technical, at least not on the surface. It’s about the simple fact, that us Europeans are stuck waiting for new services and new features until Amazon feels like they’re ready for us to use after US customers gave them a thorough beating.&lt;/p&gt;

&lt;p&gt;Wouldn’t it be nice to be able to use EU-based EC2 instances together with SimpleDB without having to pay traffic on both ends, because SimpleDB is only reachable on the cheap from within the US network of EC2? It sure would, and it finally is now, more than two years after the launch of SimpleDB.&lt;/p&gt;

&lt;p&gt;Wouldn’t it be awesome if we could utilize Elastic Load Balacing or the awesome CloudWatch? Oh yes, it would. But we’re sort of left in an infinite loop, without estimates on when new features will launch on the old continent. Can we do that now? Check!&lt;/p&gt;

&lt;p&gt;But in general all we can do is wait and hope for the best, and maybe a couple of months later, we’re served as well. Every time Amazon announces a new feature for one of their services, we’re sure to be left out. Maybe next time, instead of beta-testing new features in the US, why not let us Europeans have a go at it?&lt;/p&gt;

&lt;p&gt;It’s been bugging me for a while now, and while I understand that there’s technical things to consider when new features are introduced, it’s still quite frustrating not being able to use them, just because we prefer using Amazon’s Web Services over here, on the old continent.&lt;/p&gt;

&lt;p&gt;If you, Amazon, could do us one favor, it would be to not wait months or even years to introduce new features on our end of the planet. We’d very much appreciate that. Also, maybe you could try to give us a ballpark on when you’re planning on releasing new features in the EU. I’m not talking about an enterprisey roadmap, just a rough figure will do.&lt;/p&gt;

&lt;p&gt;Sincerely, some of your (still happy) European customers.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/196514644</link><guid>http://dailyawswtf.com/post/196514644</guid><pubDate>Fri, 25 Sep 2009 11:55:28 +0200</pubDate></item><item><title>Beware Of EventMachine Periodic Timers</title><description>&lt;p&gt;Sounds dramatic, right? I’m not saying you should stay away from them. There’s just one thing you need to know about them when you’re using &lt;a href="http://rubyeventmachine.com"&gt;EventMachine&lt;/a&gt; to spread out work across multiple threads.&lt;/p&gt;

&lt;p&gt;EventMachine uses the concept of a reactor thread to handle distributing the work across a pool of worker threads. You’d normally use something like &lt;code&gt;defer&lt;/code&gt; to tell EventMachine to take a thread from the pool and have it run that block.&lt;/p&gt;

&lt;p&gt;When you declare a periodic timer, you also hand it over a block:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;EM.add_periodic_timer(1) do
  # some longer running task
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But here’s the kicker: when that block is executed, EventMachine will call it in the reactor thread, even when you set up your periodic timer in a block that was called from &lt;a href="http://eventmachine.rubyforge.org/#M000212"&gt;&lt;code&gt;defer&lt;/code&gt;&lt;/a&gt;, and even though you’d somehow expect EM to just do it.&lt;/p&gt;

&lt;p&gt;Now, say you have multiple timers running, and one of them executes a longer running task. That task will block all the other timers from timing. Unless of course you’re using something like &lt;a href="http://www.espace.com.eg/neverblock"&gt;NeverBlock&lt;/a&gt;. If you rely on your timers being fired, that’s not great.&lt;/p&gt;

&lt;p&gt;The solution is thankfully rather simple, you just have to defer again inside your periodic timer’s block. It’s not great, but it works. That way, the block will again be distributed to one of the threads from the pool.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;EM.add_periodic_timer(1) do
  EM.defer do
    # some longer running task
  end
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For me that really doesn’t make any sense, and it’s not documented anywhere. Many props to &lt;a href="http://methodmissing.com"&gt;Lourens&lt;/a&gt; helping me to find out the real problem. I know it’s not exactly related to AWS, but it suddenly becomes an issue when those long-running tasks are interactions with Amazon’s Web Services, especially when you have some calls that might take minutes which happens occasionally.&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/roidrage"&gt;@roidrage&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/168125804</link><guid>http://dailyawswtf.com/post/168125804</guid><pubDate>Fri, 21 Aug 2009 14:31:16 +0200</pubDate></item><item><title>How To Work Around SimpleDB Limitations</title><description>&lt;p&gt;SimpleDB’s feature set is a testament to its name. It’s simple, and comes with rather strict limitations. The biggest limit that is likely to affect you when you’re storing arbitrary texts are the 1024 bytes you can store in one attribute. There’s a couple of ways to work around that, each coming with its own problems, or imposing new limitations on your implementation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Chunk your data horizontally&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Obviously your first choice is to split up text attributes into several SimpleDB attributes, each containing a maximum of 1024 bytes. You could do that by adding a numbered suffix to the attributes. That way you can reassemble them when you refetch the data from SimpleDB. The data you get back from it is usually unordered, so you can’t rely the chunks to be returned in the order you put them in.&lt;/p&gt;

&lt;p&gt;There’s an obvious downside to this. The limit of attributes per record in SimpleDB is 256. Multiplied by 1024 that adds up to 262144 bytes you can store in one record. Each domain can carry up to a maximum of 10 gigabytes of data. Should you max out every record to this maximum you can only have 40960 records in one domain. Not great.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Chunk your data vertically&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I’m talking about vertically in the sense of SimpleDB’s records, to chunk data across several records. But there’s two kickers here. First, you’ll have an even smaller number of records at your disposal. Second, each put on a record to SimpleDB can store exactly one record. So you need to store them independently. Since SimpleDB is eventually consistent there’s no guarantee you’ll be able to fetch both records on a subsequent select.&lt;/p&gt;

&lt;p&gt;Don’t get me started on chunking across several domains, that’s just nasty.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use S3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I think you’ll agree that the above options, well, they suck. You’ll hit other limitations either way, and pretty fast depending on the frequency you’re collecting your data in.&lt;/p&gt;

&lt;p&gt;So the most obvious thing to do is put the bigger data on S3. Store them with a nice key that’s derived from your record’s identity and type. You don’t need to partition anything, you’ll just put the file, and you’re done.&lt;/p&gt;

&lt;p&gt;There’s a similar problem to the solution above: You’ll have two separate operations that are eventually consistent on their own. When you created the SimpleDB record there’s no guarantee that the data on S3 will already be available. I guess in general that’s something you can live with though, and it seems to be the most viable option.&lt;/p&gt;

&lt;p&gt;Thanks to the nature of eventual consistency S3 is only a viable option if you have non-streaming data to store. If your data arrives in chunks, you’ll run into synchronization problems pretty quickly. If the data comes in asynchronously there’s a good chance that one process will overwrite changes from another.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In general I’m not entirely happy with all the limitations SimpleDB imposes, I just scratched the surface here on the ones we’re currently struggling with. Sure, it’s close to being always on, available everywhere and pretty fast, but still, it leaves a bit of a foul taste with me.&lt;/p&gt;

&lt;p&gt;Also, if you’re using SimpleDB from an EC2 EU instance, you should be aware that you’re not getting the traffic for free. You pay for both the traffic from and to the EC2 instance, and the SimpleDB traffic. It’s not great.&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/roidrage"&gt;@roidrage&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/164808141</link><guid>http://dailyawswtf.com/post/164808141</guid><pubDate>Mon, 17 Aug 2009 13:00:03 +0200</pubDate></item><item><title>Cloud Links</title><description>&lt;p&gt;Lots of great material out there dealing with Amazon Web Services, let’s spread some link love.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://alestic.com/2009/08/ec2-talk"&gt;Presentation: Building Custom Linux Images for Amazon EC2&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://clouddevelopertips.blogspot.com/2009/07/boot-ec2-instances-from-ebs.html"&gt;Boot EC2 Instances from EBS&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://clouddevelopertips.blogspot.com/2009/07/ec2-instance-life-cycle.html"&gt;The EC2 Instance Life Cycle&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://alestic.com/2009/08/ec2-mysql-slave-snapshot"&gt;EBS Snapshots of a MySQL Slave Database on EC2&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://www.mysqlperformanceblog.com/2009/08/06/ec2ebs-single-and-raid-volumes-io-bencmark/"&gt;EC2/EBS single and RAID volumes IO benchmark&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="http://www.mysqlperformanceblog.com/2009/08/07/dissection-of-ec2-ebs-volume/"&gt;Dissection of EC2 / EBS volume&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And, if the cloud feels like something ungraspable to you, have a look at some &lt;a href="http://societyofclouds.blogspot.com/2009/08/dying-art.html"&gt;Polaroids of clouds&lt;/a&gt;, maybe that helps.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/162240315</link><guid>http://dailyawswtf.com/post/162240315</guid><pubDate>Thu, 13 Aug 2009 21:33:06 +0200</pubDate></item><item><title>Creating EC2 AMIs With Fresh SSH Host Keys </title><description>&lt;p&gt;Having a specialized EC2 AMI for your application is very useful and allows you to have one common platform to test and deploy updates. Creating and maintaining EC2 AMIs is not difficult. You can use the EC2 AMI tools by hand or the EC2 console.&lt;/p&gt;

&lt;p&gt;When you create or update an AMI from a running EC2 instance (bundling), there are a couple of things to think of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear log files&lt;/li&gt;
&lt;li&gt;Clear AWS keys&lt;/li&gt;
&lt;li&gt;Clear shell history&lt;/li&gt;
&lt;li&gt;Do not bundle /tmp&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to know more, Eric Hammond, who maintains the wonderful Alestic Ubuntu AMIs, recently gave an &lt;a href="http://alestic.com/2009/08/ec2-talk"&gt;informative presentation on creating AMIs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A small hint that I want to share with you is creating fresh/new SSH host keys on your AMIs. The Alestic Ubuntu images create a new key par for the SSH daemon at boot time. If you use those as a base, usually you go on and install a couple of packages and make your changes. Then you re-bundle and distribute your AMI. The problem is that the SSH host keys have now been generated and every instance (either yours or other peoples if you share your AMI) uses those very same keys. This is bad from a security viewpoint (Man-in-the-middle attack).&lt;/p&gt;

&lt;p&gt;So what you want to do is to re-generate the SSH keys before you bundle the AMI. Deleting the keys alone will not work as sshd will not function without the keys present. So you need to generate some at the first boot. But only at the first boot, on subsequent boots the key should stay.&lt;/p&gt;

&lt;p&gt;The Alestic AMIs solve this nicely by having a /etc/init.d/ec2-ssh-host-key-gen script that will delete the existing keys and generate new ones. At the end of the script, it makes itself un-executable. This will prevent future runs of the script.&lt;/p&gt;

&lt;p&gt;So if you want to re-bundle your AMI with fresh SSH keys, just make this script executable again and then bundle the AMI. This will create AMIs that will generate a new key pair on boot and keep it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# chmod a+x /etc/init.d/ec2-ssh-host-key-gen
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/jweiss"&gt;@jweiss&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/161291679</link><guid>http://dailyawswtf.com/post/161291679</guid><pubDate>Wed, 12 Aug 2009 16:15:00 +0200</pubDate></item><item><title>You Could Have It So Much Better With Nanite And Redis</title><description>&lt;p&gt;&lt;a href="http://github.com/ezmobius/nanite/tree/master"&gt;Nanite&lt;/a&gt; is a perfect match for a computing cloud (you know I’m talking about EC2, right?), its major feature is that you can assemble a cluster of daemons dynamically. In the cloud, where resources come and go, that’s just what you want. It scales pretty well, and whenever you fire up new agents to share increasing work load, they register themselves in the cluster, and are immediately available to handle requests.&lt;/p&gt;

&lt;p&gt;The work is distributed by another tier consisting of mappers. They delegate the work to the agents they know based on an algorithm of your choosing. For that to work all mappers must know as many of the agents as possible, all of them would obviously be the best.&lt;/p&gt;

&lt;p&gt;Agents send out heartbeats which are by default picked up by all mappers. They update their internal status map of the agents they know. The status is removed as soon as an agent terminates or times out.&lt;/p&gt;

&lt;p&gt;Requests (meaning someone wants the cluster to get off their lazy butts and do something) usually go through a mapper, that’s where you normally send them from. However, you can also send requests from one agent to another. Those requests by default again go through all mappers and is then sent to the appropriate agents.&lt;/p&gt;

&lt;p&gt;Hold on! Picked up by all mappers, and sent out again? Yes, that’s right, and usually not something you want.&lt;/p&gt;

&lt;p&gt;Here’s another scenario: One of your mappers goes down, or even several of them, or you need to bring up new mappers because your existing ones have too much work to do. When they go down, all their knowledge of agents is lost, so when they come back up, it takes a while until they pick up the heartbeats of all the agents in the cluster. You’re right thinking that sucks. Because it does.&lt;/p&gt;

&lt;p&gt;I’ve contemplated whether that’s a bug, but think about it. When you don’t have shared state amongst all the mappers, how’s an agent to know if all the mappers know of all the other agents in the cluster? What if the one mapper that picks up it currently has no knowledge of the agent that handles the service you requested?&lt;/p&gt;

&lt;p&gt;But wait! The solution is at hand. Nanite can use &lt;a href="http://code.google.com/p/redis/"&gt;redis&lt;/a&gt; to store the agent’s status. Wait, it gets better. When you use Nanite with redis, only one and exactly one mapper picks up the messages from both the heartbeat and the request queue. There’s also a third queue involved, which is used for registration. That too will only end up at one mapper.&lt;/p&gt;

&lt;p&gt;Now that’s neat, because the one mapper picking up the message will update the status in redis, and just like that all the other mappers know about new and terminated agents. You don’t lose anything. If a mapper goes down the knowledge about the agents is still in the cluster, and if it comes back up again, it immediately knows about all the agents and their services. redis ensures that your work will be evenly spread out across your cluster, so do yourself a favor and use it. Not exclusively with Nanite of course, but it’d be a start.&lt;/p&gt;

&lt;p&gt;But what if redis goes down? It’s not really a big deal is it? If you take the appropriate measures and secure undeliverable messages in an offline queue (yeah, Nanite has support for that), you can just bring it back up and your cluster will pretty much come back up all by itself. If you want to ensure that can’t happen, you can replicate redis. Or rely on the redis Ruby library and its ring-hashing. Did I mention that redis can store the data on disk as well? Yeah, that’s nice too. Not 100% reliable as the data is written asynchronously, but still good enough.&lt;/p&gt;

&lt;p&gt;That’s why you want to use redis with Nanite. As soon as you have several mappers running in your Nanite cluster, you’re doing its health and scalability a big favor.&lt;/p&gt;

&lt;p&gt;After reading all this you totally deserve a pro tip: When you work with redis, be sure to keep your mapper’s timeouts longer than the ping time of the agents. Why? Because I tore my hair out figuring out why my cluster forgot its agents every now and then. If you keep your agent’s ping times longer or as long as the timeouts on the mappers, chances are that an agent will time out on one of them which will in turn remove them from the cluster until their next ping. We’re not talking about once in a million situation, it’ll happen, trust me.&lt;/p&gt;

&lt;p&gt;— &lt;a href="http://twitter.com/roidrage"&gt;@roidrage&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/160760527</link><guid>http://dailyawswtf.com/post/160760527</guid><pubDate>Tue, 11 Aug 2009 22:52:06 +0200</pubDate></item><item><title>Identify an EBS Volume's Filesystem Type</title><description>&lt;p&gt;When you manage cloud infrastructure automatically (with &lt;a href="http://wiki.opscode.com/display/chef/Home"&gt;Chef&lt;/a&gt; maybe?), it’s not a given that you always know if an EBS volume has been properly initialized with a filesystem. Before you start parsing output from parted or fdisk, you should be aware of the awesome tool &lt;code&gt;blkid&lt;/code&gt;. Its purpose is just what we need, to output a device’s filesystem, and to give a proper return code if it doesn’t have any filesystem, or there’s no device at all. We’re speaking in terms of Linux in this case, your mileage may vary with other operating systems.&lt;/p&gt;

&lt;p&gt;The default output is to display some variables and their values. Nice to parse and all, but we really only want the filesystem type, so we use the following command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;blkid -s TYPE -o value /dev/sdi
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Sweet! On a properly initialized volume it’ll just print out the filesystem type, and otherwise it won’t print anything and return 2. I’m sure you’ll figure out how to use that in your automated infrastructure.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/159789510</link><guid>http://dailyawswtf.com/post/159789510</guid><pubDate>Mon, 10 Aug 2009 15:12:10 +0200</pubDate></item><item><title>Announcing the Amazon Web Services User Group Berlin</title><description>&lt;p&gt;We’re happy to announce the first meetup of the Berlin Amazon Web Services User Group. It’ll be on September 28, starting at 8 pm. You can RSVP on &lt;a href="http://www.amiando.com/AWS-Berlin.html"&gt;Amiando&lt;/a&gt; or on &lt;a href="https://www.xing.com/events/first-amazon-web-services-user-group-berlin-379605"&gt;Xing&lt;/a&gt;. Location is still to-be-announced. We’ll keep you posted.&lt;/p&gt;

&lt;p&gt;Special guest on this occasion will be Martin Buhr, European Business Directory - Amazon Web Services. So if there’s anything you ever wanted to know, drop by and say hi! Of course, if you have your own wtf’s with AWS, feel free to take the opportunity to talk about them.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/159697386</link><guid>http://dailyawswtf.com/post/159697386</guid><pubDate>Mon, 10 Aug 2009 11:05:48 +0200</pubDate></item><item><title>Elastic Load Balancer and EC2 instance bandwidth</title><description>&lt;p&gt;So &lt;a href="http://www.peritor.com"&gt;we&lt;/a&gt; are working on a caching-related project on EC2. In this scenario high performance is very important.&lt;/p&gt;

&lt;p&gt;We set up a &lt;a href="http://varnish.projects.linpro.no/"&gt;Varnish&lt;/a&gt; cluster on EC2 and evaluate if it can replace an existing caching infrastructure in terms of costs and requests per second. Our benchmarks yielded some interesting results. It seemed that for our caching scenario the limiting factor is bandwidth. Varnish is very humble with CPU/RAM consumption. We could easily deliver 500 to 600 requests per second with a small instance and have the box idle around 95% (uncompressed content).&lt;/p&gt;

&lt;p&gt;It turns out we are limited by bandwidth and not by CPU.&lt;/p&gt;

&lt;p&gt;In our benchmarks we were only able to push 35 MB/s on small instances. So the actual requests per seconds were dependent on the object size we were pushing. The limit was always ~35 MB/s. Our typical HTML pages were around 50 to 70 KB, so we couldn’t reach the desired requests per second as our instance was at its bandwidth limit.&lt;/p&gt;

&lt;p&gt;Usually when one instance hits its resource limits you load balance multiple ones. HAProxy is a fine example for a very robust TCP/HTTP load balancer. The problem is though, that it will not increase your bandwidth as all your traffic has to go through this one HAProxy instance. So even when you load balance multiple instances, each one is capable of pushing ~35 MB/s (—&gt; ~350 MB/s with 10 small instances), the bottleneck will still be at ~35 MB/s (aka the load balancer).&lt;/p&gt;

&lt;p&gt;So any load balancing that is driven by an EC2 instance will limit the bandwidth. The maximum you can get is the load balancer’s bandwidth. If you want/need more than that there is only network level load balancing left. Some more advanced load balancing solutions (ARP/router level) offer features like this.&lt;/p&gt;

&lt;p&gt;The question was, can Amazon’s &lt;a href="http://aws.amazon.com/elasticloadbalancing/"&gt;Elastic Load Balancer&lt;/a&gt; do this?&lt;/p&gt;

&lt;p&gt;After setting up an Elastic Load Balancer configuring multiple instances as the backends the answer was: &lt;strong&gt;No, it can’t&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It seems like the Elastic Load Balancer is also limited to one, max. 1 Gig Ethernet connection (maybe also just a small EC2 instance?) and thus cannot increase the bandwidth over 35 MB/s. This is even more critical if you use larger instances as it actually decreases your bandwidth. More in a minute.&lt;/p&gt;

&lt;p&gt;So with Elastic Load Balancer out of the question, the only available solution on EC2 is &lt;a href="http://en.wikipedia.org/wiki/Round_robin_DNS"&gt;DNS Round Robin&lt;/a&gt;. There, you’ve heard it. Yes, the old and ugly DNS Round Robin.&lt;/p&gt;

&lt;p&gt;DNS Round Robin will allow you to increase your bandwidth with every entry/instance you add. The only problem is that it is a bit inflexible and you can’t route the traffic yourself as DNS clients are picking each target/instance as they like. For a small number of instances (2-4 maybe) it is tolerable and solves our bandwidth problem.&lt;/p&gt;

&lt;p&gt;Further, it seems the bigger the instance, the more bandwidth you have. Amazon does not guarantee any bandwidth but on the XL instances we guess that you have a physical server for yourself, so you can this box’ bandwidth for yourself. On the smaller instances the bandwidth is shared and thus can also be worse than our benchmarks.&lt;/p&gt;

&lt;p&gt;So our solution is to use DNS Round Robin for wo to three HighCPU medium instances. This proved to be very cost-effective and the HighCPU medium instances push out more bytes per second than the small instances.&lt;/p&gt;

&lt;p&gt;A follow-up post will show the exact number for each instance.&lt;/p&gt;

&lt;p&gt;—
&lt;a href="http://twitter.com/jweiss"&gt;@jweiss&lt;/a&gt;&lt;/p&gt;</description><link>http://dailyawswtf.com/post/157140960</link><guid>http://dailyawswtf.com/post/157140960</guid><pubDate>Thu, 06 Aug 2009 14:55:00 +0200</pubDate></item><item><title>Running RabbitMQ on EC2</title><description>&lt;p&gt;So running RabbitMQ on EC2 is pretty straight-forward, no surprises there. Get the correct packages from &lt;a href="http://www.rabbitmq.com/server.html"&gt;their website&lt;/a&gt;, and install them. We did just that, and bundled our own images which are fueled by some user data to configure the initial vhosts and users with their permissions.&lt;/p&gt;

&lt;p&gt;From time to time that would fail, and the running RabbitMQ was busted. Killing it and restarting brought the cure, but hey, we’re in the cloud, things should work automagically.&lt;/p&gt;

&lt;p&gt;When I checked the running instance, it showed that its node name was “rabbit@(node).” Everyone I asked who knew Erlang agreed that it shouldn’t say “(node)” but instead have a real host name in there.&lt;/p&gt;

&lt;p&gt;So my guess was that RabbitMQ is just booting too early. Its start level number, at least on Ubuntu, is 20, which is quite early for a service like it. We’re using the &lt;a href="http://alestic.com/"&gt;Alestic&lt;/a&gt; images as a base, and I had a look at the custom start script it’s using to fetch user data from Amazon’s web service. Turns out these scripts use the following line to wait until the network is up:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;perl -MIO::Socket::INET -e '
until(new IO::Socket::INET("169.254.169.254:80")) {
  print"Waiting for network...\n";sleep 1
}'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So I put this in the RabbitMQ init script, just before the commands that start the server, and I also bumped up the start level, and rebundled the image. Lo and behold, no problems since then.&lt;/p&gt;

&lt;p&gt;Another issue: Mnesia (used by RabbitMQ to store routing information and some more data) apparently saves the current IP address or host name. So if you want to bundle your own RabbitMQ image, be sure to delete /var/lib/rabbitmq/mnesia before doing so. When starting up, RabbitMQ will recreate that directory.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/156314426</link><guid>http://dailyawswtf.com/post/156314426</guid><pubDate>Wed, 05 Aug 2009 12:03:31 +0200</pubDate></item><item><title>Permissions for Private EC2 AMI Images on S3</title><description>&lt;p&gt;We had a weird issue with EC2 not being able to read our AMI images stored on S3. It reported an error 403 with the suggestion to check the ACL permissions. That’s not very helpful all by itself, if the permissions seem fine with regards to its owner.&lt;/p&gt;

&lt;p&gt;Turns out EC2 uses a special user id for fetching the images, and that user needs read access to the files in your bucket. Behold the user’s name, it’s za-team. Obviously you won’t be able to just go ahead and use S3Fox to set permissions for that user with just the name, you need the the hashed id in its full glory. Let me save you the trouble, here it is: 6aa5a366c34c1cbe25dc49211496e913e0351eb0e8c37aa3477e40942ec6b97c.&lt;/p&gt;

&lt;p&gt;Neat, huh?&lt;/p&gt;

&lt;p&gt;Also, beware of using S3Fox to recursively set permissions on “folders.” It might just remove the user za-team without you noticing it.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/155758689</link><guid>http://dailyawswtf.com/post/155758689</guid><pubDate>Tue, 04 Aug 2009 19:34:30 +0200</pubDate></item><item><title>Disassociating an Elastic IP</title><description>&lt;p&gt;When you do that, it takes several minutes for EC2 to pick it up and automatically assign it a new IP from the pool. Unless of course you assign it a new IP from your pool of Elastic IPs.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/155752574</link><guid>http://dailyawswtf.com/post/155752574</guid><pubDate>Tue, 04 Aug 2009 19:23:12 +0200</pubDate></item><item><title>Wtf?</title><description>&lt;p&gt;Working with the Amazon Web Services every day, you’re bound to come across oddities and bits that aren’t really documented anywhere. That’s what this is about. Bloopers, little bits, code samples, it’ll all go in here.&lt;/p&gt;</description><link>http://dailyawswtf.com/post/155773830</link><guid>http://dailyawswtf.com/post/155773830</guid><pubDate>Tue, 04 Aug 2009 19:03:00 +0200</pubDate></item></channel></rss>
