He with the most data wins (?)
Relational databases
typically don’t operate at such size. Tom my knowledge, the world’s
largest production SQL database is owned by Yahoo!. Reportedly holding 1
Petabyte in May 2008, if the bold estimates of their VP were
true, they would grow tenfold by today. That’s PostgreSQL. Oracle installations
are usually more modest in size, in this
2008 presentation they list only four customers with Oracle data
warehouses above 50 TB: AT&T, NYSE Euronext, Sprint/Nextel and Yahoo! –
again – at 250 TB. The largest I know of is the CERA database of World Data Center for Climate (WDCC), handling 400 TB in a federation of Oracle 9i.
I am not going to
elaborate in the Yahoo! direction and will put aside the Internet industry in
this discussion, with the obvious winners in terms of storage size: Google and
the Web 2.0 crowd. Curt Monash
lately gave a nice summary on this: eBay’s Greenplum data warehouse has 6.5
petabytes, Facebook has 2.5. The Internet industry is a case on its own, as
besides storing data it also generates data. Most data in the Internet is also about
Internet, replicated and secondary in its nature. Google’s indexed storage is nothing more than a replica
of the Internet. Your email provider stores countless repetitive information in
your mailbox, generated by people hitting Reply All button.
I haven’t seen any
recent attempt to estimate how much data the Internet has. However, calculating
an upper boundary of the data the mankind has today is not that difficult. I
think the humanity consumes in the order of 50 exabytes of new storage every
year. This is an intelligent guess based on my rough knowledge of the production
capacity of the storage industry today (argue with me!). Obviously not all of
the hard disk drives are immediately filled with data. A similar number, 50
EB is probably also equal to all the digital data that mankind has ever
generated, including all copies of stolen MP3's counted separately. Never mind.
Even though it would
be cool estimate how much data Internet has, I am not sure what clever
conclusion one might reach. Besides an obvious one that mankind wouldn’t loose
much if 99% of it disappeared.
Let’s put aside the issue
of self-generating data. It is more interesting how far we can get in generating
the ‘primary’ data, I mean the data that describes the world, the mankind, the
business and the industry. In which domain we shall expect the most dramatic data explosion? As
I have already said physics experiments are expected to generate a few petabytes
every year. Public repositories of satellite geological data reach numbers in similar
order of magnitude. WDCC (mentioned earlier in the relation to Oracle) holds 6 TB of climate data on tapes. Industry still sits behind these scientific examples: Wallmart, once said to run US industry’s largest data
warehouse, had 2.5 petabytes at last count. This is less than CERN, but at least in the same order of magnitude. The gap is not as dramatic as it was a decade ago.
It might be that
bioinformatics will soon become the number one storage-intensive discipline. In the Virolab project, we contribute to the clinical efforts to fight AIDS by handling and processing large number of viral genome sequences, sequenced from thousands of HIV-positive patients.These we store in a complex federated database system spanning multiple data banks across Europe. But even this amount of data, collected by decades-long effort of a dozen hospitals is trivial when compared to the need in certain disciplines of genomics.
Sequencing a single human genome
generates 750 MB of data, enough to fill a CD. A genome of a microbe might be
10 times less. A relatively new branch of genomics is Metagenomics, dealing
with genetic material – typically belonging to various microbes - recovered
from the environment. I hear that Craig J. Venter, the central figure in decoding a
human genome, is off to his second Global Ocean
Sampling yachting expedition. His boat Sorcerer II (now probably some place
in the Carribeans) takes a sample of sea water every 200 miles in its cross-Atlantic
journey. Every sample – potentially containing millions of microorganisms - will be
subject to shotgun sequencing, so that scientists can reason about distribution
of genes in the environment. The data is made available to science by the CAMERA project.
If all the genes from all the samples were sequenced and stored in the
repository, CAMERA would soon supercede in size any of the data warehouses I
quoted earlier – by two or three orders of magnitude. This of course isn’t
happening any time soon due to practical reasons. Cost of sequencing a single
genome is still a few thousands USD, but that goes down and may soon become
available to masses.
Now this idea of
extracting genetic information from the water around the globe raises a disturbing
question. In order to understand
the universe, are we going to first put the entire universe into the digital domain?
Of course not, because then the storage would become larger than the Earth. Still it seems we are going in this weird direction. The
point at which the act of digitizing information becomes environmentally
visible should raise social concern. Distant future? No. We have already
hit that point, with largest data centers in banking and IT generating enormous
amounts of heat and need dedicated power plants. Add to this a yet more threatening
quote: maintaining an avatar in the Second Life
virtual reality game, requires 1,752 kilowatt hours of electricity per year.
That is almost as much used by an average Brazilian (N. Carr)
Eventually, the idea
of creating a digital copy of the world is pointless for yet another reason. In
the case of sequencing the genome we can’t even properly say that the data is being digitized,
because the DNA – with its four-letter alphabet - already *is* a perfectly digital
information.
Paraphrasing an aged Sun’s
slogan, I like to think that the Universe is the computer, so we don’t need
another one.



Subscribe
Comments