« Performance and scalability of in-memory databases | Main | Is business logic at home in the database? »

May 22, 2009

He with the most data wins (?)

When four years ago we consulted for the United States Geological Survey, their tape robot operated 2 petabyte data repository. Back then I thought we dealt with one of world’s largest data collections. I hear that CERN currently has 15 petabytes, and expect to add a few peta every year of its LHC operation.

Relational databases typically don’t operate at such size. Tom my knowledge, the world’s largest production SQL database is owned by Yahoo!. Reportedly holding 1 Petabyte in May 2008, if the bold estimates of their VP were true, they would grow tenfold by today. That’s PostgreSQL. Oracle installations are usually more modest in size, in this 2008 presentation they list only four customers with Oracle data warehouses above 50 TB: AT&T, NYSE Euronext, Sprint/Nextel and Yahoo! – again – at 250 TB. The largest I know of is the CERA database of World Data Center for Climate (WDCC), handling 400 TB in a federation of Oracle 9i.

I am not going to elaborate in the Yahoo! direction and will put aside the Internet industry in this discussion, with the obvious winners in terms of storage size: Google and the Web 2.0 crowd. Curt Monash lately gave a nice summary on this: eBay’s Greenplum data warehouse has 6.5 petabytes, Facebook has 2.5. The Internet industry is a case on its own, as besides storing data it also generates data. Most data in the Internet is also about Internet, replicated and secondary in its nature. Google’s indexed storage is nothing more than a replica of the Internet. Your email provider stores countless repetitive information in your mailbox, generated by people hitting Reply All button.

I haven’t seen any recent attempt to estimate how much data the Internet has. However, calculating an upper boundary of the data the mankind has today is not that difficult. I think the humanity consumes in the order of 50 exabytes of new storage every year. This is an intelligent guess based on my rough knowledge of the production capacity of the storage industry today (argue with me!). Obviously not all of the hard disk drives are immediately filled with data. A similar number, 50 EB is probably also equal to all the digital data that mankind has ever generated, including all copies of stolen MP3's counted separately. Never mind.

Even though it would be cool estimate how much data Internet has, I am not sure what clever conclusion one might reach. Besides an obvious one that mankind wouldn’t loose much if 99% of it disappeared.

Let’s put aside the issue of self-generating data. It is more interesting how far we can get in generating the ‘primary’ data, I mean the data that describes the world, the mankind, the business and the industry. In which domain we shall expect the most dramatic data explosion? As I have already said physics experiments are expected to generate a few petabytes every year. Public repositories of satellite geological data reach numbers in similar order of magnitude. WDCC (mentioned earlier in the relation to Oracle) holds 6 TB of climate data on tapes. Industry still sits behind these scientific examples: Wallmart, once said to run US industry’s largest data warehouse, had 2.5 petabytes at last count. This is less than CERN, but at least in the same order of magnitude. The gap is not as dramatic as it was a decade ago.

It might be that bioinformatics will soon become the number one storage-intensive discipline.  In the Virolab project, we contribute to the clinical efforts to fight AIDS by handling and processing large number of viral genome sequences, sequenced from thousands of HIV-positive patients.These we store in a complex federated database system spanning multiple data banks across Europe. But even this amount of data, collected by decades-long effort of a dozen hospitals is trivial when compared to the need in certain disciplines of genomics.

Sequencing a single human genome generates 750 MB of data, enough to fill a CD. A genome of a microbe might be 10 times less. A relatively new branch of genomics is Metagenomics, dealing with genetic material – typically belonging to various microbes - recovered from the environment. I hear that Craig J. Venter, the central figure in decoding a human genome, is off to his second Global Ocean Sampling yachting expedition. His boat Sorcerer II (now probably some place in the Carribeans) takes a sample of sea water every 200 miles in its cross-Atlantic journey. Every sample – potentially containing millions of microorganisms - will be subject to shotgun sequencing, so that scientists can reason about distribution of genes in the environment. The data is made available to science by the CAMERA project. If all the genes from all the samples were sequenced and stored in the repository, CAMERA would soon supercede in size any of the data warehouses I quoted earlier – by two or three orders of magnitude. This of course isn’t happening any time soon due to practical reasons. Cost of sequencing a single genome is still a few thousands USD, but that goes down and may soon become available to masses.

Now this idea of extracting genetic information from the water around the globe raises a disturbing question. In order to understand the universe, are we going to first put the entire universe into the digital domain? Of course not, because then the storage would become larger than the Earth. Still it seems we are going in this weird direction. The point at which the act of digitizing information becomes environmentally visible should raise social concern. Distant future? No. We have already hit that point, with largest data centers in banking and IT generating enormous amounts of heat and need dedicated power plants. Add to this a yet more threatening quote: maintaining an avatar in the Second Life virtual reality game, requires 1,752 kilowatt hours of electricity per year. That is almost as much used by an average Brazilian (N. Carr)

Eventually, the idea of creating a digital copy of the world is pointless for yet another reason. In the case of sequencing the genome we can’t even properly say that the data is being digitized, because the DNA – with its four-letter alphabet - already *is* a perfectly digital information.

Paraphrasing an aged Sun’s slogan, I like to think that the Universe is the computer, so we don’t need another one.

Bookmark and Share

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a01156f69dc6b970c01156fa99ce1970c

Listed below are links to weblogs that reference He with the most data wins (?):

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment