Daily AWS Wtf: An endless stream of thoughts, wtf's and fixes for the latter on Amazon Web Services.
January 14, 2010 at 8:55pm
0 notes

A Word on EC2 Instance Proximity

This morning I had an incident that wasn’t a first for me, so I investigated a bit further. I got the feeling that two instances I launched ended up being on the same physical machine.

Disclaimer: This is in no way a fully forensic investigation, I’m merely putting some data together. The only true way to find out where the instances are located would be to walk into Amazon’s data center and check, but I’m not too sure that’s gonna be possible any time soon. Feel free to add additional information in the comments. If I’m totally off with my guess, please let me know.

I tested in the EU regions, so your mileage may vary in others. I’m not posting this to make EC2 look bad. These are musing and findings, and I find it quite interesting to look into stuff like that, if only to find out why the instance’s booting was so gosh darn slow.

I fired up two small EC2 instances in the same availability zone while doing some testing for Scalarium. Our workflow involves initial provisioning right after boot which usually takes quite a while on small instances.

This time it was different, because the provisioning part took a lot longer than usual, on both instances, at least twice as long. This has happened before and it reminded me of this paper on “Exploring Information Leakage in Third-Party Compute Clouds”. Recommended read if you’re spending quality time in the cloud and want to know if and how others are able to guess the physical location of your instance.

Guessing just from the load is a long shot, but it’s an indication. The instances were launched almost simultaneously, and according to the paper EC2 launches instances across machines in order. So two instances of the same type, launched in the same availability zone have a good chance of being on the same physical machine.

The paper mentions some other indicators, so I investigated:

  • Close IP proximity. The last digit in the first hop’s IP address was 2 on the first, and 3 on the second instance. Same subnet, IP increased by one. Supposably that hop is the Dom0 address, which goes back to the hypervisor.

  • Traceroute. Doing a traceroute from one instance to the other had only one hop. I tracerouted another one I launched some time later: Four hops.

  • Short packet round trip. The round trip was short, usually less than 0.5ms, but the same was true for the third instance. That’s throwing off the whole initial theory a bit, because maybe that means the two initial instances didn’t run on the same machine. You could argue that the pings on the first two instances were a bit faster, but not all the time, we’re talking about a scale of less than 0.05ms. I’m not sure if that counts.

The paper mentioned a warm-up phase that happened before the first ping or traceroute hop. I could reproduce that. Also, every 10 to 15 pings one would take a little bit longer than the others. The phase happened on all machines on all pings and traceroutes.

There’s other probes you can do, but they’re a bit out of my forensic league. The indicators above made me suspicious, even just the first one alone. Maybe they’re not on the same physical host, chances look pretty good to me though. If you have more suggestions on things I could try, the instances are still running. Let me know.

Do yourself a favor and read the paper, hard to swallow at times, but interesting stuff. some people say that Amazon has changed some internal details in EC2 already making some of the issues obsolete, but it sure doesn’t seem to be the case here.

Comments (View)
November 12, 2009 at 4:56pm
0 notes

SimpleDB Gotcha - A Follow-Up

Two weeks ago we posted something about the terms and conditions of SimpleDB and how they simply stated that data could be deleted after six months.

Thankfully Amazon didn’t leave that rather ambiguous claim as it is and recently updated their customer agreement appropriately:

5.8.2. […] If during the previous six (6) months you have incurred no fees for SimpleDB and have registered no usage of Your Amazon SimpleDB Content, we may delete, without liability of any kind, Your Amazon SimpleDB Content upon thirty (30) days prior notice to you.

Still not great to know your data might get deleted, but now you know a lot better under what conditions Amazon reserves the right to clean up after you, and they’ll notify you early enough before they would actually do so.

Thanks Amazon, I’m glad you cleared that up!

@roidrage

Comments (View)
October 28, 2009 at 9:58am
1 note

SimpleDB gotcha

While carefully reading through the AWS Customer Agreement we found this interesting paragraph:

5.8.2 […] We may delete, without liability of any kind, any of your Amazon SimpleDB Content that has not been accessed in the previous 6 months.

Ouch!

While SimpleDB keeps surprising us, for our EC2 cluster management platform Scalarium we switched to CouchDB and Redis some time ago. Turns out SimpleDB is sometimes too simple.

@jweiss

Comments (View)
October 27, 2009 at 11:40am
0 notes

Amazon Announces New EC2 And AWS Features

Today started with a bang, there’s no doubt about that. With three simple announcements Amazon introduced two new features for their Elastic Compute Cloud: Relational Database Service and new high memory instance types.

Until now EC2 the highest memory you could get was 15.5GB. While that’s a lot, it’s not enough for some applications where databases have out-grown that amount of memory. If that was your excuse so far to not give EC2 a spin, you’re finally out of excuses. Today Amazon introduced two new instances types, one with 34GB and one with 68GB of memory. Cashing in at $1.20 and $2.40 respectively, they aren’t cheap, but they’re now officially an option.

The bigger deal though is the Relational Database Service, inarguably targeted as a competitor to Microsoft’s Azure. While I first thought of it as something similar to SimpleDB, it’s in fact a managed MySQL instance, with automated backup, maintenance and everything. You tell Amazon the time window for backups and maintenance, and you’re good to go. The prices are slightly higher than normal EC2 instances, but consider for a moment what you get in return. Managed MySQL is no picnic price-wise, and I’d consider this a really good value. Amazon uses the maintenance window to patch the database, upgrade your storage, and whatnot.

Of course you have to factor in the potential downtime in your application. There’s no way of knowing if Amazon will use the full maintenance window of four hours, or just a small part of it, or not at all for any particular week.

What’s interesting is the list of features they’re planning on adding in the future, in particular the automated replication across different availability zones. The mind boggles how they’re going to implement that, because MySQL’s replication sucks balls. Given the fact that you have no way of looking deeper into MySQL’s log files without proper access to the machines I’m curious how they’re going to reliably solve that problem.

Which brings me to the downsides of the new RDS instances. There’s apparently no way to log into them from the outside. At least no obvious way, because the API doesn’t allow you to specify SSH keys like the EC2 API does. Also, user data doesn’t seem to be supported either. It sort of makes sense because they’re fully managed, but it’s still a bit of a bummer in my opinion.

While managed MySQL in itself is awesome, a deeper look at the API reveals some interesting details. You can modify a running RDS instance. You can increase the available storage at runtime, and you can change the instance type at runtime. Doing so usually results in an outage, but you can also make use of the maintenance window when you upgrade storage, which is the default.

These are the kinds of features I find particularly interesting, and I’m wondering if and when they’ll make it to EC2. Sure, you can automate all that stuff, but it’d still be a nice feature.

Questions I’m left with are: Is the backup fully reliable? If it is, do they lock the database for it? Same for snapshots I can do via the API. As always I’d wish for the API documentation to reflect potential side effects. And as usual, why only for the US? While the new instance types are available in both the US and in the EU, us EU customers are once again left waiting for the Relational Database Service. In terms of supported engines, what’s the future have in stock for us? Are we talking different database systems, e.g. Postgres, or different MySQL engines, like XtraDB (which would be sweet).

Oh yes, and EC2 pricing got cheaper, quite notably for larger instances, but that doesn’t excite me as much as the other two announcements of the day.

@roidrage

Comments (View)
October 16, 2009 at 1:28pm
0 notes

How About Reserved S3 Storage?

During lunch with the technical leads from SoundCloud they floated a simple yet genius idea, especially for customers with very large storage requirements on S3 like they are: Reserved Storage. Reserve so many terabytes per year for a certain amount of time, pay an upfront amount for that, and get greatly reduced monthly storage costs. Sounds like an excellent idea to me, and would be an very logical addition to Reserved Instances.. They asked their local AWS evangelist about it already, so who knows? Maybe it’s somewhere on the horizon.

Comments (View)
October 7, 2009 at 11:24am
0 notes

Does Greatly Increased Network Traffic on EC2 Instances Decrease EBS Performance?

I’m sort of throwing this out there. BitBucket, a source code hosting service based on Mercurial and running on EC2 and off EBS volumes, had a long downtime recently, and it was apparently caused by a DDoS attack flooding their servers with spoofed UDP traffic.

Putting aside the initial oddness with Amazon’s support and that having such an insanely long downtime sucks big time, one thing is quite interesting to consider.

Access to their EBS volumes was horribly slow while the network traffic peaked out. The network traffic increase seemed to correlate directly to the decrease of EBS volumes, which in turn are read from and written on over the network. The facts at hand leave me to assume that EBS volume access happens over the same network interface, be it virtual or physical, as the normal network access to the EC2 instance.

Sure, we’re talking about an extraordinary peak in traffic, but I’m quite curious how the balance works out when you have say daily peaks during the evening hours, which involve heavy I/O on your EBS volumes, for whatever reasons.

You could and you should argue that this shouldn’t happen a lot, BitBucket is quite a special case. You should keep as much data on external storage like S3, but services like BitBucket can’t rely on that, they need the data on disk, the same is true for databases.

It’s hard to think of a simple and universal solution for this, as is always the case for DDoS attacks. The traffic needs to be capped above the level of the instances, which is exactly what Amazon did in this case.

In the end I hope I’m wrong with the assumption ventured in the subject, or that at least that it will be fixed in the near future.

@roidrage

Comments (View)
September 25, 2009 at 11:55am
0 notes

Amazon Web Services and the EU

Just recently Amazon gave us European EC2 customers a well-deserved feature update. We got CloudWatch, Elastic Load Balancing and Auto Scaling. As an added bonus we also finally got SimpleDB access, which means less lag and no traffic costs when using SimpleDB from a European EC2 instance. Let me take that hook to talk about Amazon Web Services and the EU.

That’s right, today’s Wtf is not technical, at least not on the surface. It’s about the simple fact, that us Europeans are stuck waiting for new services and new features until Amazon feels like they’re ready for us to use after US customers gave them a thorough beating.

Wouldn’t it be nice to be able to use EU-based EC2 instances together with SimpleDB without having to pay traffic on both ends, because SimpleDB is only reachable on the cheap from within the US network of EC2? It sure would, and it finally is now, more than two years after the launch of SimpleDB.

Wouldn’t it be awesome if we could utilize Elastic Load Balacing or the awesome CloudWatch? Oh yes, it would. But we’re sort of left in an infinite loop, without estimates on when new features will launch on the old continent. Can we do that now? Check!

But in general all we can do is wait and hope for the best, and maybe a couple of months later, we’re served as well. Every time Amazon announces a new feature for one of their services, we’re sure to be left out. Maybe next time, instead of beta-testing new features in the US, why not let us Europeans have a go at it?

It’s been bugging me for a while now, and while I understand that there’s technical things to consider when new features are introduced, it’s still quite frustrating not being able to use them, just because we prefer using Amazon’s Web Services over here, on the old continent.

If you, Amazon, could do us one favor, it would be to not wait months or even years to introduce new features on our end of the planet. We’d very much appreciate that. Also, maybe you could try to give us a ballpark on when you’re planning on releasing new features in the EU. I’m not talking about an enterprisey roadmap, just a rough figure will do.

Sincerely, some of your (still happy) European customers.

Comments (View)
August 21, 2009 at 2:31pm
0 notes

Beware Of EventMachine Periodic Timers

Sounds dramatic, right? I’m not saying you should stay away from them. There’s just one thing you need to know about them when you’re using EventMachine to spread out work across multiple threads.

EventMachine uses the concept of a reactor thread to handle distributing the work across a pool of worker threads. You’d normally use something like defer to tell EventMachine to take a thread from the pool and have it run that block.

When you declare a periodic timer, you also hand it over a block:

EM.add_periodic_timer(1) do
  # some longer running task
end

But here’s the kicker: when that block is executed, EventMachine will call it in the reactor thread, even when you set up your periodic timer in a block that was called from defer, and even though you’d somehow expect EM to just do it.

Now, say you have multiple timers running, and one of them executes a longer running task. That task will block all the other timers from timing. Unless of course you’re using something like NeverBlock. If you rely on your timers being fired, that’s not great.

The solution is thankfully rather simple, you just have to defer again inside your periodic timer’s block. It’s not great, but it works. That way, the block will again be distributed to one of the threads from the pool.

EM.add_periodic_timer(1) do
  EM.defer do
    # some longer running task
  end
end

For me that really doesn’t make any sense, and it’s not documented anywhere. Many props to Lourens helping me to find out the real problem. I know it’s not exactly related to AWS, but it suddenly becomes an issue when those long-running tasks are interactions with Amazon’s Web Services, especially when you have some calls that might take minutes which happens occasionally.

@roidrage

Comments (View)
August 17, 2009 at 1:00pm
0 notes

How To Work Around SimpleDB Limitations

SimpleDB’s feature set is a testament to its name. It’s simple, and comes with rather strict limitations. The biggest limit that is likely to affect you when you’re storing arbitrary texts are the 1024 bytes you can store in one attribute. There’s a couple of ways to work around that, each coming with its own problems, or imposing new limitations on your implementation.

  • Chunk your data horizontally

    Obviously your first choice is to split up text attributes into several SimpleDB attributes, each containing a maximum of 1024 bytes. You could do that by adding a numbered suffix to the attributes. That way you can reassemble them when you refetch the data from SimpleDB. The data you get back from it is usually unordered, so you can’t rely the chunks to be returned in the order you put them in.

    There’s an obvious downside to this. The limit of attributes per record in SimpleDB is 256. Multiplied by 1024 that adds up to 262144 bytes you can store in one record. Each domain can carry up to a maximum of 10 gigabytes of data. Should you max out every record to this maximum you can only have 40960 records in one domain. Not great.

  • Chunk your data vertically

    I’m talking about vertically in the sense of SimpleDB’s records, to chunk data across several records. But there’s two kickers here. First, you’ll have an even smaller number of records at your disposal. Second, each put on a record to SimpleDB can store exactly one record. So you need to store them independently. Since SimpleDB is eventually consistent there’s no guarantee you’ll be able to fetch both records on a subsequent select.

    Don’t get me started on chunking across several domains, that’s just nasty.

  • Use S3

    I think you’ll agree that the above options, well, they suck. You’ll hit other limitations either way, and pretty fast depending on the frequency you’re collecting your data in.

    So the most obvious thing to do is put the bigger data on S3. Store them with a nice key that’s derived from your record’s identity and type. You don’t need to partition anything, you’ll just put the file, and you’re done.

    There’s a similar problem to the solution above: You’ll have two separate operations that are eventually consistent on their own. When you created the SimpleDB record there’s no guarantee that the data on S3 will already be available. I guess in general that’s something you can live with though, and it seems to be the most viable option.

    Thanks to the nature of eventual consistency S3 is only a viable option if you have non-streaming data to store. If your data arrives in chunks, you’ll run into synchronization problems pretty quickly. If the data comes in asynchronously there’s a good chance that one process will overwrite changes from another.

In general I’m not entirely happy with all the limitations SimpleDB imposes, I just scratched the surface here on the ones we’re currently struggling with. Sure, it’s close to being always on, available everywhere and pretty fast, but still, it leaves a bit of a foul taste with me.

Also, if you’re using SimpleDB from an EC2 EU instance, you should be aware that you’re not getting the traffic for free. You pay for both the traffic from and to the EC2 instance, and the SimpleDB traffic. It’s not great.

@roidrage

Comments (View)
August 13, 2009 at 9:33pm
0 notes

Cloud Links

Lots of great material out there dealing with Amazon Web Services, let’s spread some link love.

And, if the cloud feels like something ungraspable to you, have a look at some Polaroids of clouds, maybe that helps.

Comments (View)