How To Work Around SimpleDB Limitations
SimpleDB’s feature set is a testament to its name. It’s simple, and comes with rather strict limitations. The biggest limit that is likely to affect you when you’re storing arbitrary texts are the 1024 bytes you can store in one attribute. There’s a couple of ways to work around that, each coming with its own problems, or imposing new limitations on your implementation.
-
Chunk your data horizontally
Obviously your first choice is to split up text attributes into several SimpleDB attributes, each containing a maximum of 1024 bytes. You could do that by adding a numbered suffix to the attributes. That way you can reassemble them when you refetch the data from SimpleDB. The data you get back from it is usually unordered, so you can’t rely the chunks to be returned in the order you put them in.
There’s an obvious downside to this. The limit of attributes per record in SimpleDB is 256. Multiplied by 1024 that adds up to 262144 bytes you can store in one record. Each domain can carry up to a maximum of 10 gigabytes of data. Should you max out every record to this maximum you can only have 40960 records in one domain. Not great.
-
Chunk your data vertically
I’m talking about vertically in the sense of SimpleDB’s records, to chunk data across several records. But there’s two kickers here. First, you’ll have an even smaller number of records at your disposal. Second, each put on a record to SimpleDB can store exactly one record. So you need to store them independently. Since SimpleDB is eventually consistent there’s no guarantee you’ll be able to fetch both records on a subsequent select.
Don’t get me started on chunking across several domains, that’s just nasty.
-
Use S3
I think you’ll agree that the above options, well, they suck. You’ll hit other limitations either way, and pretty fast depending on the frequency you’re collecting your data in.
So the most obvious thing to do is put the bigger data on S3. Store them with a nice key that’s derived from your record’s identity and type. You don’t need to partition anything, you’ll just put the file, and you’re done.
There’s a similar problem to the solution above: You’ll have two separate operations that are eventually consistent on their own. When you created the SimpleDB record there’s no guarantee that the data on S3 will already be available. I guess in general that’s something you can live with though, and it seems to be the most viable option.
Thanks to the nature of eventual consistency S3 is only a viable option if you have non-streaming data to store. If your data arrives in chunks, you’ll run into synchronization problems pretty quickly. If the data comes in asynchronously there’s a good chance that one process will overwrite changes from another.
In general I’m not entirely happy with all the limitations SimpleDB imposes, I just scratched the surface here on the ones we’re currently struggling with. Sure, it’s close to being always on, available everywhere and pretty fast, but still, it leaves a bit of a foul taste with me.
Also, if you’re using SimpleDB from an EC2 EU instance, you should be aware that you’re not getting the traffic for free. You pay for both the traffic from and to the EC2 instance, and the SimpleDB traffic. It’s not great.
