« Need for change in your IT infrastructure | Main | Oracle and IBM databases: Disk-based vs In-memory »

November 23, 2009

Big Data on Grids or on Clouds?

Wolfgang Gentzsch, Senior Strategist at GridwiseTech, Open Grid Forum, and EU Project DEISA

Now that we have a new computing paradigm, Cloud Computing, how can Clouds help our data? Replace our internal data vaults as we hoped Grids would? Are Grids dead now that we have Clouds? Despite all the promising developments in the Grid and Cloud computing space, and the avalanche of publications and talks on this subject, many people still seem to be confused about internal data and compute resources, versus Grids versus Clouds, and they are hesitant to take the next step. I think there are a number of issues driving this uncertainty.

Grids didn't keep all their promises

Grids did not evolve (as some of us originally thought) into the next fundamental IT infrastructure for everything and everybody. Because of the diversity of computing and data environments, we had to develop different middleware (department, enterprise, global, compute, data, sensors, scientific instruments, etc.), and had to face different usage models with different benefits. Enterprise Grids were (and are) providing better resource utilization and business flexibility, while global Grids are best suited to complex R&D collaboration with resource sharing. For enterprise usage, setting up and operating Grids was often complicated and did not remove all the (data) bottlenecks. For researchers this characteristic was seen to be a necessary evil. Implementing complex applications on supercomputers has never been easy. So what.

Grid: the way station to the Cloud

After 40 years of dealing with data processing, Grid computing was indeed the next big thing for the grand challenge R&D expert, while for the enterprise CIO, the Grid was a way station on its way to the Cloud model. For the enterprise today, Clouds are providing all the missing pieces: easy to use, economies of scale, business elasticity up and down, and pay-as you go (thus getting rid of some capital expenditure). And in cases where security matters, there is the private Cloud, within the enterprise’s firewall. In more complex enterprise environments, with applications running under different policies, private Clouds can easily connect via the Internet to (external) public Clouds -- and vice versa -- forming a hybrid Cloud infrastructure that balances security with efficiency.

Different policies, what does that mean?

No data processing job is alike. Jobs differ by priority, strategic importance, deadline, budget, IP and licenses. In addition, the nature of the code often necessitates a specific computer architecture, operating system, memory, storage, and other resources. These important differentiating factors strongly influence where and when a data processing job is run. For any job, a set of specific requirements decide on the set of specific policies that have to be defined and programmed, such that any of these jobs will run just according to these policies. Ideally, this is guaranteed by a dynamic resource broker that controls submission to Grid or Cloud resources, be they local or global, private or public.

Grids or Clouds?

One important question is still open: how do I find out, and then tell the resource broker, whether my data should run on the Grid or in the Cloud? The answer, among others, depends on the algorithmic structure of the program, which might be intolerant of high latency and low bandwidth. The performance limitations are exhibited mainly by parallel applications with tightly-coupled, data-intensive inter-process communication, running in parallel on hundreds or even thousands of processors or cores.

The good news is, however, that many applications do not require high bandwidth and low latency. Parameter studies often seen in science, engineering, and business intelligence, where a single self-contained application executes with many different parameters, resulting in many independent jobs. The list of examples is extensive -  analyzing the data from a particle physics collider, identifying the solution parameter in optimization, ensemble runs to quantify climate model uncertainties, identifying potential drug targets via screening a database of ligand structures, studying economic model sensitivity to parameters, and analyzing different materials and their resistance in crash tests, to name just a few.

Big Data needs Grids or Clouds, and often both

Obviously, there is no “Grids or Clouds” for the enterprise. There is just “Grids and Clouds”, it really depends on the individual scenario. In general, CIOs have to evaluate three different scenarios:

  • (1) the Private Cloud: optimizing and virtualizing the company’s internal enterprise IT infrastructure, including the data layer (here is where Momentum can help);
  • (2) the Hybrid Cloud: do (1) and connect to external clouds;
  • or (3) the Public Cloud: do (2) and successively move data (processing) to the external cloud provider.

The choice for the best-suited scenario depends on many aspects: sensitive / competitive data and applications (e.g. medical patient records), individual return on investment, security policies, interoperability between private and public clouds, lose of control when moving data outside the corporation, cloud-enabling data and applications, the current software licensing model, protection of intellectual property, legal issues, and more.

The good news is that CIOs can always start with a hybrid infrastructure in mind: combining private and public cloud resources, balanced according to specific requirements. This provides the best of both worlds, avoiding the worst of each individual world.

Bookmark and Share


TrackBack URL for this entry:

Listed below are links to weblogs that reference Big Data on Grids or on Clouds? :


Feed You can follow this conversation by subscribing to the comment feed for this post.

Hi Wolfgang, it's been a few years since we worked together on Grid and I'm now working on cloud based deployments.

Your background in HPC shows in that you are talking about scheduling jobs, while that was the focus for Grid, there is a strong additional focus on interactive transactional web services for cloud.

The other key difference is that grid was about locating resources that match what you need to run a job, while cloud uses virtualization to create the resources you want. It's more of a model driven architecture than a resource matching problem.

Best wishes, Adrian


I have read your post on how to improve in a cloud the network path. You suggested a tool to determine "how many network hops are between two EC2 instances chosen at random, and how congested are those hops."

"Given this kind of insight, a re-allocation algorithm could detect poorly located EC2 instances that are failing an SLA and try to replace them with ones that behave better."

I thought this is great idea for future cloud products.

However in the classical grid scheduling of jobs, there is a beneficial transition to SLA assigned for applications delivered as services. HPC Grid software assumes there is never enough resources for everyone. This is why we have priorities, Policies and all sorts of complicated rules to create a rationale on how the resources should be allocated.

Grid's challenge is the lack of guaranteed constant SLA by definition. If I am alone in a 100 node grid, whether I need or not, I get all nodes for myself (the goal is to utilize 100% all resources). As a 2ns user gets in, I might get 50 nodes and if a high priority user needs all 100 nodes, I am kicked out.

The difference between a grid and cloud in HPC will be to have an elastic cloud, i.e. one can have spare pools of resources and a EC connector for public clouds, to being more resources (nodes) able to deliver a service level agreed upon. This implies the user has a pay-per-use model, while it appears to him that the cloud has seemingly "infinite resources".

See here a definition for what is the cloud:

We have added to Sun Grid Engine both a Services Domain Manager (manages resources needed to fulfill an SLA) and cloud connector with AMI for EC2 Amazon clouds,

Surely, you point our that managing a cloud is more than scheduling jobs. You refer to persistent, processes, not HPC loads for one service that terminates on the job is executed.

There will be a hand shake some day, but no one will notice. It will take place INSIDE the cloud. The mixed users will be with either HPC or database type loads, and they see no difference. All the complexities will be hidden inside the cloud.

The word Grid will be replaced by the word Cloud. We do not have to decide between Grids and Clouds. All Grids will become clouds, with very few exceptions. I mean private clouds that create internally the same services one can get from Amazon, with their own hardware and software, and access outside resources only at peak demand.

Wolfgang, you write thought provoking ideas, as always.



Hi Adrian, Miha, nicely stated, thanks. Indeed, I am looking through my HPC eyes at clouds, trying to identify how best my HPC jobs can use (and benefit from) clouds, w/o introducing new problems such as e.g. performance losses, privacy or IP violations, and so on. As you know well (and also stated by Miha's post) HPC consists of mostly compute-intensive jobs causing a non-persistent load, unlike the persistent load of transactional web services.

But beyond this essential difference, both loads (and also including big data here) have many challenges in common, especially the mental (trust, control) and legal (Hippa, FDA) issues which I discussed in my blog. To get over those issues, the hybrid cloud approach (private/public clouds) might help (especially in the beginning), including the use of policy instruments Miha is mentioning.

For me, for the time being, supercomputers and Grids are here to stay. Not all workloads (today) can easily run on Clouds. Complex science is such a candidate for supercomputers and Grids, especially if it needs plumbing and tuning to optimally match a complex scientific workload with the underlying resources; and if collaboration and resources sharing in teams of widely distributed scientists is key. And when I say 'resources', I really mean servers, storage, scientific instruments and experiments, sensor networks, applications and data.

Thanks for your valuable comments, Wolfgang

Dear Miha,

I have the feeling that you contradict yourself somewhat (no offense!).

You state: "Grid's challenge is the lack of guaranteed constant SLA by definition."

I can sympathize with that statement: Grids are about management and scheduling of *scarce* resources, and SLA's can by definition not be kept for all users.

But you go on and say: "The word Grid will be replaced by the word Cloud. We do not have to decide between Grids and Clouds." etc. But clouds manage not scarce, but (virtually) *abundant* resources, and are thus able to implement strong SLA's.

You are thus implying, IMHO, that, when Grids become Clouds, or dissolve into Clouds, they will include abundant HPC resources. I can't see that happening, at least not in the near future, and not for the hard core HPC users. Can you?

Otherwise, if those HPC-Clouds would have *scarce* HPC components, one would require the same complex schedule and policy layer again, to fairly manage those scarce resources, and one would thus would destroy the simple cloud abstraction (which, as you phrase it, simply creates or instantiates the required resources).

Best, Andre.

Andre, you mention just another strong differentiator between Grids and Clouds when you say that Grids manage and schedule scarce resources, while Clouds manage abundant resources and for that SLAs make sense; I like that.
Whenever there is a real benefit for an HPC application to run in the Cloud, it will, as mentioned in my blog, especially now with the upcomping HPC virtualization technologies from ScaleMP, 3Leaf, RNA Networks, NextIO, and NumaScale. Still many others will have to run on Grids or just on high-end supercomputers, but not in the Cloud, because they require high bandwidth and low latency, which a Cloud does not provide (at least not today).

Hallo Wolfgang,

to be fair: the scarce/abundant resources distinction was (to my knowledge) formulated by Greg Pfister on the Cloud-Computing Google list. I like it, too, and it has a number of interesting implications.

For example, for our (grid) applications, it allows us to consider a cloud, such as EC2, as an additional Grid queue withe zero waiting time and fixed SLA, offering small disconnected resources. That way, we can easily expand trivially parallel application components into EC2, using the same Grid paradigms (scheduling, resource mapping, etc) as before.

Anyway, I just wanted to mention the origin of the idea really...

Best, Andre.


"You are thus implying, IMHO, that, when Grids become Clouds, or dissolve into Clouds, they will include abundant HPC resources. I can't see that happening, at least not in the near future, and not for the hard core HPC users. Can you?"

The definition of clouds I use is a business model, not an architecture or a technology. The HPC applications, being leading edge, are very cumbersome to deliver as a service today. The advanced user must know what is going on inside the Grid.

The problem is not if the HPC grids once transformed into clouds, will include abundant resources. The problem is , will you have enough users to buy these HPC services?

If the answer is Yes, then it is no reason why a Grid should not become a cloud, by adding the necessary, expensive HPC resources and tools like elasticity and billing. There are simply too many benefits to ignore. The answer is based on sheer economics. Operating as Grids (no billings, no elasticity), we will perpetuate the image of Grids as black holes that suck money ad infinity, with no pay back to investors. "HPC" is like a "scare ghost" to private investors (SiCortex just blew $68M venture capital to go under, after SGI, Cray, Thinking machines and so on). Aren't we tired to see only gargantuan players (IBM) or very small companies surviving in HPC?

You don't have to take my word that HPC will be soon delivered in Clouds. This is the only hope to be widely adopted. Project Magellan got $32M.

" a project that will deploy a large cloud computing test bed with thousands of Intel Nehalem CPU cores and explore commercial offerings from Amazon, Microsoft and Google.... Ultimately ... Magellan will look at cloud computing as a cost-effective and energy-efficient way for scientists to accelerate discoveries in a variety of disciplines, including analysis of scientific data sets in biology, climate change and physics, the DOE stated."


Isn't this exactly what we call today HPC applications? Sure Supercomputers comsuming more power than a small city and with non-standard software will continue to exist for a while, but they will be rarer and rarer, and more powerful than ever before for research niche applications. But it will be more and more challenging to finance those dinsaurus-supercomputers, unless they can integrated in a cloud, completely invisible to outside users prepared to pay nice money for exotic, high end, yet useful, calculations.

Visiting with my son one the many dinosaur exhibitions (this one was in Salt Lake City, Utah) i asked

"Do I look as a dinosaur?"
"No dad. You are not extinct"

We should be grateful we are not extinct and do our best to adapt superior forms to deliver HPC.

2 cents,


Miha, when you say that "HPC will soon be delivered in Clouds", that's similar to saying that, for example, soon all travellers will travel with trains. However, the spectrum of HPC is as wide as the means of transportation. There is no one transportation which serves us all and in every circumstance. And it's not only economics which decide about the choice. Same with HPC. No one Cloud can replace Tier 0 (currently Pflops) or Tier-1 (Tflops) HPC systems, for the foreseeable future. What I can see is that in HPC, Clouds might compete on some applications (e.g. parameter studies) with the regional or industry Tier-2 systems (say up to #100 of the Top500 systems), because the Cloud's economy of scale might then become an additional decision factor, beside strategic, algorithmic, and mental requirements.

Hi, Wolfgang it has been a long time. There are two major areas that need to be considered when putting data into a cloud in my opinion that many people take for granted.
1. The hard error rate of the underlying media and the impact on build.
2. The undetectable error rate and the impact on rebuild.

Some of the people I am working with have 20PB-50PB of data. If you take a Hadoop cloud model with an OC-48 inbetween you cannot keep up with the replcation on disk given the hard error rates of the media along with the AFR. I think the whole area needs to be rethought for archival data. Here is my current thinking.



Thanks, Henry, this is indeed one of the biggest challenges we face with very large files. Good to see your article on the enterprise storage forum. I am nervously awaiting your answer on the many interesting (?) comments there. Thanks, Wolfgang

The comments to this entry are closed.