My Gripes with SimpleDB

2009 January 26
by Matt

I’ll start off by saying that I am admiration of what Amazon have done in providing a well thought out series of web services. Their dedication to the cause is second to none, and it does not go unnoticed. I currently use them for nearly all internal projects, and highly recommend our clients to do the same.

The ability to instantly deploy additional resources on Amazons EC2, and fire and forget distributed storage through S3 on such magnitude has left me in awe. And then theres SimpleDB.

SimpleDB in itself could be a magnificent platform - the potential of a supra-linear horizontally scalable database is something that is likely to excite anyone familiar with the headaches of terabyte datasets. Unfortunately, this is where SimpleDB is not quite the silver bullet we were all excited about.

The issues with a traditional RDBMS, as mentioned in my previous article on map/reduce are that you eventually need to start sharding out your data. As I discussed, you can then perform parallel queries against the sharded datasets in order to retrieve the required data.

Behind the scenes, this is how SimpleDB is expected to work - which is great - so wheres the problem?

The problem is that with larger datasets on SimpleDB - and this is the killer - you need to partition (read shard) it amongst so-called domains/buckets/databases (call them what you will).  This requires you to, as Amazon puts it, “aggregate result sets in the application layer”. Each “domain” is limited to a 10GB partition, with a maximum of 100 domains per account (I dare say this is not a hard limit). The number of records itself is at a meagre 250 million (which is rather humorous when you consider that you cant actually store that many EMPTY records within the size allowance).

So, as a real world example of scalability, would SimpleDB really make a good replacement for twittersdata-store? Lets bear in mind, of course, that any datastore aside from a persistant message queue is only going to be used for actual offline storage of the data.

Lets put it this way, given a lightweight storage for each tweet, including overhead for storage (the dynamic indexing costs are surprisingly expensive) in simpleDB we would be looking at around 750B/tweet. At 1.8 million tweets per day, thats 1.28GB/day.

Requiring partitioning every week, they could go for 2 years provided no increase in volume, and no accounting for other storage of non-tweet data. They would also need to increase their retrieval costs weekly, as they are increasing the number of “domains” they need to search in parallel, and they also need to beef up their appservers every week to account for aggregating an ever increasing amount of results in the application layer. The aggregation costs alone, provided no increase in usage would be 50 times higher within 1 year.

Fair enough in certain respects - twice the data, twice the storage and retrieval costs. But aggregating in the application layer means unnecessary overhead on the app servers - a non-linear and unwarranted expense that should be overcome at the data-store.

Amazon - pull your finger out, and stop partitioning simpleDB datasets!

note: all twitter figures provided here are for example and from my own external calculations based on references given and from a very lightweight datastructure perspective. I have not covered cost issues of twitter relationships, which in themselves would cause substantial increase over the figures shown here.

One Response leave one →
  1. 2010 February 13

    I am completely impressed with the article I have just read. I wish the writer of wordflows.com can continue to provide so much practical information and unforgettable experience to wordflows.com readers. There is not much to say except the following universal truth: The motto for all people in life should be kill the slow, castrate the stupid I will be back.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS