Big Data on Grids or on Clouds?
Now that we have a new computing paradigm, Cloud Computing, how can Clouds help our data? Replace our internal data vaults as we hoped Grids would? Are Grids dead now that we have Clouds? Despite all the promising developments in the Grid and Cloud computing space, and the avalanche of publications and talks on this subject, many people still seem to be confused about internal data and compute resources, versus Grids versus Clouds, and they are hesitant to take the next step. I think there are a number of issues driving this uncertainty.
Grids didn't keep all their promisesGrids did not evolve (as some of us originally thought) into the next fundamental IT infrastructure for everything and everybody. Because of the diversity of computing and data environments, we had to develop different middleware (department, enterprise, global, compute, data, sensors, scientific instruments, etc.), and had to face different usage models with different benefits. Enterprise Grids were (and are) providing better resource utilization and business flexibility, while global Grids are best suited to complex R&D collaboration with resource sharing. For enterprise usage, setting up and operating Grids was often complicated and did not remove all the (data) bottlenecks. For researchers this characteristic was seen to be a necessary evil. Implementing complex applications on supercomputers has never been easy. So what.
Grid: the way station to the Cloud
After 40 years of dealing with data processing, Grid computing was indeed the next big thing for the grand challenge R&D expert, while for the enterprise CIO, the Grid was a way station on its way to the Cloud model. For the enterprise today, Clouds are providing all the missing pieces: easy to use, economies of scale, business elasticity up and down, and pay-as you go (thus getting rid of some capital expenditure). And in cases where security matters, there is the private Cloud, within the enterprise’s firewall. In more complex enterprise environments, with applications running under different policies, private Clouds can easily connect via the Internet to (external) public Clouds -- and vice versa -- forming a hybrid Cloud infrastructure that balances security with efficiency.
Different policies, what does that mean?
No data processing job is alike. Jobs differ by priority, strategic importance, deadline, budget, IP and licenses. In addition, the nature of the code often necessitates a specific computer architecture, operating system, memory, storage, and other resources. These important differentiating factors strongly influence where and when a data processing job is run. For any job, a set of specific requirements decide on the set of specific policies that have to be defined and programmed, such that any of these jobs will run just according to these policies. Ideally, this is guaranteed by a dynamic resource broker that controls submission to Grid or Cloud resources, be they local or global, private or public.
Grids or Clouds?
One important question is still open: how do I find out, and then tell the resource broker, whether my data should run on the Grid or in the Cloud? The answer, among others, depends on the algorithmic structure of the program, which might be intolerant of high latency and low bandwidth. The performance limitations are exhibited mainly by parallel applications with tightly-coupled, data-intensive inter-process communication, running in parallel on hundreds or even thousands of processors or cores.
The good news is, however, that many applications do not require high bandwidth and low latency. Parameter studies often seen in science, engineering, and business intelligence, where a single self-contained application executes with many different parameters, resulting in many independent jobs. The list of examples is extensive - analyzing the data from a particle physics collider, identifying the solution parameter in optimization, ensemble runs to quantify climate model uncertainties, identifying potential drug targets via screening a database of ligand structures, studying economic model sensitivity to parameters,
Big Data needs Grids or Clouds, and often both
there is no “Grids or Clouds” for the enterprise. There is just
“Grids and Clouds”, it really depends on the individual scenario.
In general, CIOs have to evaluate three different scenarios:
- (1) the
Private Cloud: optimizing and virtualizing the company’s internal
enterprise IT infrastructure, including the data layer (here is where Momentum can help);
- (2) the Hybrid Cloud: do (1) and connect to
- or (3) the Public Cloud: do (2) and successively move data (processing) to the external cloud provider.
The choice for
the best-suited scenario depends on many aspects: sensitive /
competitive data and applications (e.g. medical patient records),
individual return on investment, security policies, interoperability
between private and public clouds, lose of control when moving data
outside the corporation, cloud-enabling data and applications, the
current software licensing model, protection of intellectual
property, legal issues, and more.
The good news is that CIOs can always start with a hybrid infrastructure in mind: combining private and public cloud resources, balanced according to specific requirements. This provides the best of both worlds, avoiding the worst of each individual world.