« Virtualized migration | Main | Securing highly distributed data collections »

June 25, 2009

The difficult marriage of cloud and data-intensive apps

Two days ago, Songnian Zhou, Platform's CEO, sent out an email blast with "one of the biggest announcements in the company's history". Platform, known for its LSF workload management family, just announced their new product iSF. This middleware makes it possible to build private Clouds and run applications on the Cloud.

This coincided (not by chance) with yesterday's announcement from Platform's historically biggest competitor.

The Sun Grid Engine 6.2u3 (SGE) now features Amazon Elastic Cloud EC2 Adapter. So SGE can manage execution hosts that are Amazon virtual machines.I said historically - because, once both product lines enter the already crowded Cloud arena, the balance of power might look quite different.

In certain sectors that were the early adopters of Grids, migration to the Cloud is bound to happen soon. Pharmaceuticals is a good example. As Bob Cohen told me today and pointed out at his recent presentation :

  • Eli Lilly has already tried using the Amazon EC2 external Cloud,
  • GlaxoSmithKline is looking at using internal Clouds

As I remember, years ago Glaxo was among the early Grid users. Like many other pharmas, they used software from UnivaUD to distribute protein docking simulations to large number of machines. Now UnivaUD also sells Cloud services.

This supposedly common movement from Grid to Cloud is thought provoking. What does this "evolutionary step" really mean? Something conceptually quite simple: The Grid makes it possible to manage processes in many physical machines. The Cloud offers even an greater potential: to manage processes in many virtual machines, or even to manage those virtual machines like they were processes. This is what VMware vSphere offers or what internally powers Amazon EC2.  So:

Cloud = set of virtual machines managed by a scheduler (Grid).

All those named above are great products. If you want the internal Cloud. The thing is: in solving large data challenges, the Cloud is no less limited than its predecessor, the Grid. Chris Dagidigian of gridengine.info said at the recent BioIT that he "solved real problems" on the Cloud. That is not surprising (BioTeam once teamed with Univa to demonstrate Grid Engine on AWS), but such a statement needs an explicit remark: problems solvable on the Cloud are still a small subset of the World's important data processing challenges. Virtualized Cloud environments are perfectly isolated from each other. If you have one, pray that you only happen to compute tasks that can be domain-decomposed into millions of perfectly independent pieces. Protein docking, mentioned earlier, is like that: thousands of simulations, one per each set of chemical compounds, that do not need to communicate. But most apps in most industry sectors (not just bioinformatics) do not share these characteristics: they require intensive database querying and/or data sharing. Genomics is like that. The Cloud will not help here. Clouds may even make it more difficult. I also agree here with another of Chris's statements: Cloud data ingest is a pain.

The answer to large data challenges is a puzzle of three pieces:

  1. efficient distributed processing
  2. efficient data provisioning
  3. efficient storage

Workload management engines (Grids, Clouds, you name it) provide the first point. Today's data intensive apps need the full stack, an efficient integration of (1), (2) and (3). A fully scalable data integration. Where's the challenge then?

Efficient processing is easy. With many great scheduling vendors, this is not rocket science any more. Efficient storage is becoming commonplace too, with interesting examples of federated storage distributions. The trick is in the middle layer: an efficient connection between these two. And that is really difficult. There is certainly no universal solution, but we have recently had some successes here. My colleague Tomasz Mikolajczyk will be talking about it next Monday, June 29, 2009, at ISMB conference in Stockholm, Sweden:

Technology Track, 2:45 p.m: "Your large data: query, process, share - in no time"

If you are there, please join. If not, I will soon comment more, here at bigdatamatters.

Bookmark and Share

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a01156f69dc6b970c0115705f6ea9970c

Listed below are links to weblogs that reference The difficult marriage of cloud and data-intensive apps:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Regarding Sun Grid Engine and its EC2 adapter, it should be possible to use it in a private cloud (aka internal cloud). Though various clouds have different API's, some of the support Amazon's EC2 API as well e.g. Eucalyptus http://open.eucalyptus.com/ provides support for Amazon's cloud API's (i.e. EC2, S3, and ESB). It means that at least theoretically it is possible to deploy SGE into private Eucalyptus-powered cloud in order to control cloud resources.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment