The difficult marriage of cloud and data-intensive apps
This coincided (not by chance) with yesterday's announcement from Platform's historically biggest competitor.
The Sun Grid Engine 6.2u3 (SGE) now features Amazon Elastic Cloud EC2 Adapter. So SGE can manage execution hosts that are Amazon virtual machines.I said historically - because, once both product lines enter the already crowded Cloud arena, the balance of power might look quite different.
In certain sectors that were the early adopters of Grids, migration to the Cloud is bound to happen soon. Pharmaceuticals is a good example. As Bob Cohen told me today and pointed out at his recent presentation :
- Eli Lilly has already tried using the Amazon EC2 external Cloud,
- GlaxoSmithKline is looking at using internal Clouds
As I remember, years ago Glaxo was among the early Grid users. Like many other pharmas, they used software from UnivaUD to distribute protein docking simulations to large number of machines. Now UnivaUD also sells Cloud services.
This supposedly common movement from Grid to Cloud is thought provoking. What does this "evolutionary step" really mean? Something conceptually quite simple: The Grid makes it possible to manage processes in many physical machines. The Cloud offers even an greater potential: to manage processes in many virtual machines, or even to manage those virtual machines like they were processes. This is what VMware vSphere offers or what internally powers Amazon EC2. So:
Cloud = set of virtual machines managed by a scheduler (Grid).
All those named above are great products. If you want the internal Cloud. The thing is: in solving large data challenges, the Cloud is no less limited than its predecessor, the Grid. Chris Dagidigian of gridengine.info said at the recent BioIT that he "solved real problems" on the Cloud. That is not surprising (BioTeam once teamed with Univa to demonstrate Grid Engine on AWS), but such a statement needs an explicit remark: problems solvable on the Cloud are still a small subset of the World's important data processing challenges. Virtualized Cloud environments are perfectly isolated from each other. If you have one, pray that you only happen to compute tasks that can be domain-decomposed into millions of perfectly independent pieces. Protein docking, mentioned earlier, is like that: thousands of simulations, one per each set of chemical compounds, that do not need to communicate. But most apps in most industry sectors (not just bioinformatics) do not share these characteristics: they require intensive database querying and/or data sharing. Genomics is like that. The Cloud will not help here. Clouds may even make it more difficult. I also agree here with another of Chris's statements: Cloud data ingest is a pain.
The answer to large data challenges is a puzzle of three pieces:
- efficient distributed processing
- efficient data provisioning
- efficient storage
Workload management engines (Grids, Clouds, you name it) provide the first point. Today's data intensive apps need the full stack, an efficient integration of (1), (2) and (3). A fully scalable data integration. Where's the challenge then?
Technology Track, 2:45 p.m: "Your large data: query, process, share - in no time"
If you are there, please join. If not, I will soon comment more, here at bigdatamatters.



Subscribe
Regarding Sun Grid Engine and its EC2 adapter, it should be possible to use it in a private cloud (aka internal cloud). Though various clouds have different API's, some of the support Amazon's EC2 API as well e.g. Eucalyptus http://open.eucalyptus.com/ provides support for Amazon's cloud API's (i.e. EC2, S3, and ESB). It means that at least theoretically it is possible to deploy SGE into private Eucalyptus-powered cloud in order to control cloud resources.
Posted by: Chris Wilk | June 25, 2009 at 01:39 PM