Why I don’t use AppEngine

2009 January 27
by Matt

I was recently asked why I don’t use Google AppEngine for some of the larger projects I am working on.

Quite simply, I would love to. But currently there are a few issues:

1) There is no SLA whatsoever. I could wake up one morning, and all of my content could be gone. I cannot even efficiently make backups of the data.

2) There is no guaranteed availability of expanding quotas. The projects that I would like to deploy on AppEngine would use the storage capacity alone in less than 2 days. Im happy to pay for that, but the service is not yet available.

3) The inability to support wildcard and top level domain names. This is a major issue for the kind of platforms we build. We’re looking at overcoming this by simply treating AppEngine as a datastore.

4) Lack of coherency between the google apps accounts and google accounts have meant that my mobile number is already in use on a deleted account, and I cannot register a new appengine profile. There is NO support whatsoever to address this, NO way to contact google about it, and their SMS feedback form gives nothing in the way of “feedback” (aside from the cryptic message “comment” if you fill in too many characters for the comments section).

5) No support, No support, No support.

These are my own thoughts about the AppEngine platform. There are others that people have mentioned, such as the inability to upload files over 1MB in size - but I would say these are issues easily overcome when you treat AppEngine as an application layer in front of (what appears to be) a kickass datastore, instead of a complete web production environment. Those familiar with CouchDB will be aware of the ease of integration that these platforms can provide (although I dare say BigTable is without the scalability issues that plague the current releases of CouchDB).

That is exactly how we intend to leverage it, having being let down by the limitations of SimpleDB, and our currently unstable internal datastore not yet holding up to the demand of our soon-to-be-released platform.

Heres hoping that my (multiple) “SMS feedback” actually went somewhere - no “thanks for your feedback” or anything…

If you’re listening Google, please let me activate my account! I promise i’m not a spam-bot :D

My Gripes with SimpleDB

2009 January 26
by Matt

I’ll start off by saying that I am admiration of what Amazon have done in providing a well thought out series of web services. Their dedication to the cause is second to none, and it does not go unnoticed. I currently use them for nearly all internal projects, and highly recommend our clients to do the same.

The ability to instantly deploy additional resources on Amazons EC2, and fire and forget distributed storage through S3 on such magnitude has left me in awe. And then theres SimpleDB.

SimpleDB in itself could be a magnificent platform - the potential of a supra-linear horizontally scalable database is something that is likely to excite anyone familiar with the headaches of terabyte datasets. Unfortunately, this is where SimpleDB is not quite the silver bullet we were all excited about.

The issues with a traditional RDBMS, as mentioned in my previous article on map/reduce are that you eventually need to start sharding out your data. As I discussed, you can then perform parallel queries against the sharded datasets in order to retrieve the required data.

Behind the scenes, this is how SimpleDB is expected to work - which is great - so wheres the problem?

The problem is that with larger datasets on SimpleDB - and this is the killer - you need to partition (read shard) it amongst so-called domains/buckets/databases (call them what you will).  This requires you to, as Amazon puts it, “aggregate result sets in the application layer”. Each “domain” is limited to a 10GB partition, with a maximum of 100 domains per account (I dare say this is not a hard limit). The number of records itself is at a meagre 250 million (which is rather humorous when you consider that you cant actually store that many EMPTY records within the size allowance).

So, as a real world example of scalability, would SimpleDB really make a good replacement for twittersdata-store? Lets bear in mind, of course, that any datastore aside from a persistant message queue is only going to be used for actual offline storage of the data.

Lets put it this way, given a lightweight storage for each tweet, including overhead for storage (the dynamic indexing costs are surprisingly expensive) in simpleDB we would be looking at around 750B/tweet. At 1.8 million tweets per day, thats 1.28GB/day.

Requiring partitioning every week, they could go for 2 years provided no increase in volume, and no accounting for other storage of non-tweet data. They would also need to increase their retrieval costs weekly, as they are increasing the number of “domains” they need to search in parallel, and they also need to beef up their appservers every week to account for aggregating an ever increasing amount of results in the application layer. The aggregation costs alone, provided no increase in usage would be 50 times higher within 1 year.

Fair enough in certain respects - twice the data, twice the storage and retrieval costs. But aggregating in the application layer means unnecessary overhead on the app servers - a non-linear and unwarranted expense that should be overcome at the data-store.

Amazon - pull your finger out, and stop partitioning simpleDB datasets!

note: all twitter figures provided here are for example and from my own external calculations based on references given and from a very lightweight datastructure perspective. I have not covered cost issues of twitter relationships, which in themselves would cause substantial increase over the figures shown here.

Twitter OAuth in Beta - And what they SHOULD do.

2009 January 25
by Matt

Twitter have finally made a move against the issues with third party services using users primary account details for integration with peoples private feeds. Given the recent phishing attacks that plagued the twittersphere, people have realised the issues with trust over their private credentials, and the fear of extending a limited application with third party twitter-apps.

I say its not a moment too soon - this is something that twitter themselves should have implemented as soon as it became apparent that a lot of interest was being generated in services that interface with peoples private feeds. Whether this be a lifestreaming platform like friendfeed, or a simple desktop client like twhirl - any service that requires you to enter your username and password has complete access to your account, and can masquerade as yourselves, thereby opening up the issues to those who follow you.

Twitter have decided to tackle this by using OAuth, which in itself is probably the best way of implementing this level of security. Lets just hope they don’t make the mistake of not implementing an area in your control panel where you can revoke issued credentials - a key failure I’ve seen in a lot of half-arsed attempts at implementing the OAuth protocol (you read it here first ;)).

Twitter will need to revoke the current API’s authentication, otherwise some of the more “conservative” developers will keep using the old style authentication. People obviously don’t think twice when giving out their usernames and passwords, and that will continue. They need to implement blanket coverage, or the main issues with viral phishing have still not been addressed. In implementing this blanket change, all existing twitter apps will need to be redeveloped to account for the changes, implementing the somewhat more complex OAuth protocol in order to integrate with the twitter API.

Now, while i’m all for implementing OAuth for the primary method of integrating the twitter API, I would also look at providing a vastly simplified version, without the need for a lot of chatter between twitter and your client.

A simple way of doing this would be to have an area in your twitter account where you could generate subsidiary usernames and passwords for your account. For example, you could create the username “twirl@twitteruser” with its own unique password that you simply use inplace of your existing username and password when using the twitter API. You could assign permissions for that service, allowing it to perform only a certain subset of the full API. Best of all - it would be fully compatible with the existing API, allowing immediate rollout and implementation on all existing twitter services. Friendfeed offer a similar implementation they call “Remote Keys”

In all - I’m glad that twitter are making the change, and OAuth is certainly the best way for them to go. But why overcomplicate integration with the existing APIs. Keep the old one in place, remove the ability to use your master username/password and allow subsidiary credentials to be created on each account so we can simply replace our twitter details in the systems we already use. Oh - and how come its taken so long to do the obvious.

Google GDrive on the way?

2009 January 25
by Matt

People have long used existing google apps in order to leverage file-storage. The ability to use gmail as a filesystem, and googles existing “google docs” solutions go part way to implementing this as a service.

The latest buzz appears to be about a google filesystem solution being touted as GDrive.

As an avid user of online backup solutions, specifically our own bespoke solution working on top of Amazons S3, this certainly seems like a viable and interesting proposition - but it leaves me wondering why it has taken so long?

Mashable mentioned this way back in 2007, with the ability of increasing your storage allowance on Google, but a dedicated service has not managed to materialize yet. With alternate services, such as third parties leveraging Amazons S3 service, maybe google have opened their eyes to the restrictions their current Google Docs provides, and are looking to expand upon this.

Google already provides indexing solutions for your documents, through both google desktop, and the google search appliance - it seems like a natural shift to be able to provide storage within their cloud to utilize their existing search platform.

The existing Gmail service is currently offering 7288MB of email storage - much of which will never be touched by the vast majority of mail users. If this limit is truly available to each gmail user (as opposed to simply hyping up their true capacity), it would make sense that they currently have considerably more storage resources than they require. Maybe this could be a viable usage of googles current overstocking of storage?

Google, as ever, are currently keeping silent about their plans - and the gdrive is currently nothing more than a rumour. But given the current size of mailbox allowances being far beyond the worldly requirements of most,  It certainly seems like the direction they should be looking towards.

UPDATE: As mentioned by HandyBiteSize, hidden mentions of the possible GDrive appear in Google Apps CSS Files

The future of the Twitter API

2009 January 21
by Matt

As mentioned at Mashable twitter have imposed a rate limit for their “whitelisted” users API calls. These are generally used by the kind of services that integrate with the public timeline, and/or provide unauthenticated calls upon the standard twitter API.

A lot of smaller external services use these to communicate with twitter to receive updates from your feed to display within their own environment. Larger, or rather, more established services with potential tie-ins to the big-G have seemingly already garnered priority feeds through twitters firehose API.

As I previously tweeted, it certainly feels like twitter are going to be generating revenue through the use of their “firehose” API. A currently closed service that is rumoured to be in use by some of the bigger players in the social media “extranet”, such as friendfeed. Could this spell the end of other minor twitter services - leaving the market for implementing peoples own private data in the realms of those with the cash to spare?

Whilst your data is knowingly made public by yourself as a user, what are you thoughts of a twitter becoming a walled garden, with gates for those with the cash to spare?

If it is simply a case of expense, my proposal would be to implement a different form of public API - or better yet, generate some revenue through methods that don’t involve separating the technical userbase into the haves and have-nots.

I for one have no problem with twitter implementing advertisements - if they are targeted effectively and/or relevant to what i’m tweeting about. Anything to keep an otherwise great service to keep operating the way it has, and not the direction it seems to be going in.

Twitter have already stated that they reserve the right to sell your information - maybe this is a step in that direction…

Pay-for videos at YouTube?

2009 January 20
by Matt

It is well known that YouTube are losing money hand over fist - with a large portion of their losses due to royalty payments to the music labels.

It certainly seems that with the recent changes on youtube - namely Muting of Videos and the advent of Official Downloads that they have something up their sleeves.

Maybe Google, unsatisfied with the losses they are still making after implementing video adwords are looking towards implementing pay-per-view or video downloads for otherwise copyright restricted music videos?

With the only major direct competition in this market coming from Apple’s iTunes store, it could be an interesting motion. It also sounds like the ideal avenue for independent producers being able to sell their wares - something that excites me a lot, and is definitely worthy of watching out for.

Understanding Map/Reduce

2009 January 18
by Matt

I have been meaning to write about this for quite a while. Now I actually have a blog, its the ideal opportunity to simplify the process for those getting to grips, clear up any misconceptions, and explain how and why the process is far from new.

Background

What is now known as Map/Reduce has recently generated a lot of interest in deployments requiring the ability to perform queries on large-scale datasets.

Using a “traditional” RDBMS to perform these queries has resulted in issues as the number of records increases - primarily due to the requirements of a monolithic series of linked tables known as a Relational Database.

These problems have been tackled by a process known as sharding, whereby the records in the tables are split among a series of database nodes. Middleware is often introduced to abstract the queries to these nodes, but (at least on commodity RDBMS) the key ability to perform complex joins (as required data may be on another node) has already been lost.

At its most basic, Map/Reduce can be compared to a sharded RDBMS. In fact, an implementation of Map/Reduce may actually USE a sharded RDBMS by behaving as the middleware.

Map/Reduce simplified

A simple way of explaining a map/reduce function is with a real world example. Ill try and illustrate this as best possible.

Map/reduce is usually demonstrated as a distributed group and count, but it could be any function performed in a distributed manner. Here we will do a find and sort, as its a nice easy introduction.

You have a black bin liner stuffed full of receipts which you need to group and order for your tax return. You need to look through the receipts and discard all those you are not likely to be able to file, and then order them according to date.

This bin liner contains 1000 records, and it will take you 30 seconds to decide whether or not each receipt needs to be discarded.

When you are sorting the receipts into date order, it will gradually become more difficult, as you will need to thumb through the list of already sorted receipts to find the place to insert it. This will take an increasing amount of time to do, as you cannot hold the position of each date in memory - an average of 0.1 second for each receipt in the already sorted pile.

This would take you:

30,000s = 1000 * 30s for discarding
49,950s = 999(1000)/2 * 0.1s for sorting

Thats just over 22 hours.

Now, if you could rope in 4 friends to help you out, and assuming they all work at the same speed, you could act as an overseer - simply sharing the work among them, and collating the receipts when they are finished.

It will take each of your friends:

7,500s = 250 * 30s for discarding
3,112.5s = 249(250)/2 * 0.1s for sorting

Just under 3 hours each.

When it comes to combining all of these sorted piles, it is quite straight forward - as you only need to look through 4 receipts at any one time - so it will take you a total of 0.4s as you simply grab the next value from the top of the relevant pile.

So you can merge the piles in:

400s = 1000 * 4 * 0.1s  for sorting from the pre-sorted piles.

Thats only 7 minutes of sorting you need to do after they have finished the job - and you have finished what would take you 22 hours alone in 3 hours!

In terms of total time spent - if you were to value your own time at £50 p/h, you could still pay your friends that rate and save over £500 :)

An important point to note is that simply increasing the number of nodes does not neccessarily mean an increase in overall performance in small datasets. The larger the dataset, the faster the relative speedup will be. To save us the most amount of cash, we would want to employ 19 friends to do the task for us:

 

receipt_sorting_performance

 

Technical summary

As we have demonstrated in our real world example, Map/Reduce is nothing new. Its done by businesses on a daily basis in order to save them time and money in simple staffing costs.

Obviously, certain points I have made may not seem valid in a systems environment. Obviously “seek” times will be considerably less (although they are for illustration!), and in-memory pointers will alleviate the sorting issue - or will it?

Whilst when we are dealing with small amounts of data, the sorting of records could indeed be done in memory. The problem comes when you are dealing with Gigabytes or even Terabytes of the the stuff. The beauty of a secondary (or higher) reduce/merge is that it can be streamed. In our example, it would be the supervisor only needing to have 4 receipts in memory - the top of the sorted stacks. Additionally, if you are accessing ordered data on the mapping nodes, you can stream from there as well!

A map is simply a function that will be run on each node and spit out matching or relevant data (return valid receipts). A reduce is an operation on the mapped data (sort receipts). Secondary and higher reduces are generally called merges (sort of sorted receipt piles).

As you can see, in a sharded RDBMS environment, the map could simply be a “select all from receipts where isvalid=true” with the a built-in reduce of “order by date desc” run on each of the shards, with a merge occurring on the proxy/middleware. In operations like this, both can be as efficient as each other.

Where the RDBMS would fall is with a more complex map, requiring access to data not within the shard. As a standard map is just a function run on a distributed node, it could include logic to perform additional lookups on other nodes. This could be facilitated with a directory that tells it where the required data is located. I’m not aware of how this could be done effectively with a RDBMS such as MySQL, without implementing actual programatic maps.

Misconceptions

1. Schemaless

Your data can be as structured as you want it to be. All you are doing is running a function against records. Thats exactly what a RDBMS does anyway. You can run a map on an RDBMS if it makes you feel happier.

2. No indexes

You can locally index your data in whatever way you want, or, if you are dynamically mapping out the data, you can send the indexes over, or simply have an index server. Its no different to sharding in that respect.

Final Word

I will make amendments to this post over the space of the next day. I plan to insert some diagrams, as well as show how more nodes does not neccessarily mean more performance.

Apologies for my poor writing style. Its the first time I have ever written a blog, and it will take some time getting used to!

A few references you may find interesting:
 
Google Research Publication
Write your first Map/Reduce function in 20 mins
Misconceptions about MapReduce

A New Theme

2009 January 18
by Matt

As I am currently working on building my own publishing platform, I’m not overly fussed on spending much time manipulating the wordpress themes.

Thankfully, I’ve found the rather excellent Vigilance Theme which should assist in making the site look somewhat respectable for the time being!

Thanks to Mashable for the heads up!

Finally I have a blog!

2009 January 18
by Matt

Well, it’s taken a while getting there, but I finally decided to make the time to put together a blog.

Welcome to my personal musings on all that I find interesting in the media, industry and just random rants about life in general!

I’ll be learning the ropes as I go along, so please excuse any awful grammar and sheer abuse of the blogging platform. :D