« Is internal database optimization a cure for performance bottlenecks? | Main | Real-time data integration using Change Data Capture »

July 17, 2009

NoSQL – the new wave against RDBMS

Over the past month, much press has appeared in the blogosphere dedicated to the NoSQL movement. I first came across their existence by reading this article on the Computerworld web portal and have been following the heavy traffic on the subject since.

NoSQL held their inaugural get-together in San Francisco last month to discuss a future where traditional RDBMS's from the likes of Oracle, Microsoft and IBM are consigned to history in favor of open source data stores. Their ethos is that traditional RDBMS's are not scalable and force data to be twisted to fit into the relational world. What is the likelihood of a world where legacy systems are driven by the new breed of data stores?

NoSQL began the in-house development of data stores, emulating those built by Google and Amazon. These are now handling hundreds of terabytes or even petabytes of data for thriving Web 2.0 companies. Some of the notable open-source projects include Hadoop, Voldemort, Cassandra, CouchDB and Dynomite among others.  Proprietary cloud based data stores include Google App Engine 's Datastore, Amazon SimpleDBForceWindows Azure Storage Services and Force.com Database Services.

The main principles behind these projects are well summarized by Martin Kleppmann and Tony Bain. In essence, data stores are distributed key/value stores that provide unlimited scalability to store data that is closely modeled to objects removing the need for ORM plumbing code. The downsides are the loss of data integrity, which has to be managed by application code and the difficulties in performing business intelligence on the data. A work in progress comparison of alternative data stores can be found here.

The key thing that strikes me about data stores is that they cannot be viewed as databases. Nati Shalom confirms this with his excellent explanation using Amazons SimpleDB as the example. In his post he highlights the limitations of data stores with their lack of referential integrity, transaction support and data consistency (ACID). Additionally, he highlights in what areas their use can bring benefits – notably Web 2.0 applications where the requirement is mostly for read only data that has loosely defined schema's.

Many companies are seeing the performance of their existing RDBMS's drop-off as the requirement to process ever larger sets of data increases. Data stores offer an attractive prospect with their unlimited scalability and simplicity of data management. However, experience shows that their use is only realistic when developing applications from scratch. The idea of simply removing an existing RDBMS and plugging in a data store is not an option for legacy systems where the ACID properties of the systems would have to be moved from the RDBMS's to the application layer.

Additional points of consideration include the lack of vendor support for data stores and their relative immaturity. I will watch their evolution with interest, but it is clear that data stores are currently a no-go area for companies who want to solve their existing RDBMS's scalability and performance limitations.

An alternative approach is to leverage RDBMS's with one of the available in-memory data grids, such as the open source Memcached and JBoss Cache alongside the proprietary  GemFire, Oracle Coherence and GigaSpaces XAP. In-memory data grids provide scalability and performance increases with no changes to the underlying RDBMS's and minimal changes required to the application layer. Companies benefit from the in-memory data grid fulfilling the performance and scalability requirements leaving the RDBMS to do what it does best i.e. maintain the ACID properties.

To summarize:

  • Data stores are commonly used by Web 2.0 start-ups who develop the application layer coupled with the data layer

  • Data stores do not offer the ACID functionality and vendor support that RDBMS's provide

  • To address the inherent RDBMS's performance and scalability limitations, in-memory data grids can be used effectively

Bookmark and Share

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a01156f69dc6b970c01157207ed53970b

Listed below are links to weblogs that reference NoSQL – the new wave against RDBMS:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Just as an aside:

look at the description of memcache in the link you added and its further links.

In it we are told that database - note the lack of any qualification! - writers do block readers.

In other words: whoever wrote the memcache description has NEVER used Oracle, or they would know that its writers NEVER block readers!

But you can bet these guys have added Oracle into their list of dbs that "don't scale".

And folks wonder why I call these people total idiotic ignorants who are incapable of the most basic logic and informed argumentation?

Hi Noon,

Thanks very much for taking the time to point out the ambiguities in the memcached documentation.

Regarding the scalability of Oracle - this is a matter still open for debate. In our opinion the technical limitations and low ROI mean that a better choice is to leverage the existing RDBMS with an alternative technology, such as in-memory data grids.

I am keen to hear about your personal experience of scaling Oracle and how you set about it?

Hello,

NoSQL approach is not contradictory per se with RDBMS principles. Recently, I have read about HadoopDB:

http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-shorter.html

It seems to be a hybrid of PostgreSQL (using MySQL instead is also possible) and Apache Hadoop.

Those who are more interested should take a look at 12-page long paper:
http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf

I've just taken a look at MemCached. The rationale behind MemCached is basically that the MySQL (4.x) caching sucks. I guess they are probably right. Anyway, if it is the case, it is the inevitable result of Sturgeon's Law.

But then again, why not try to fix MySQL caching instead of coming up with a solution that requires you to extensively fix your existing programs?

In my case, for example, I don't see how I could ever implement MemCached in the context of a Joomla site. I would have to rewrite core Joomla code all over the place.

Furthermore, an in-memory datagrid is simply an associative array (hash), with exactly the same fundamental problems as an on-disk associative array.

The relational data storage model needs improving, but the proposals currently lying on the table, do not fix the problem. They only seem to make it worse.

Hi Erik,

You raise some interesting points.

The problems that caching technologies address are universal to RDBMS's, not only MySQL i.e. performance degradation as a result scalability limitations. Fixing pure database cache is not enough in most of HPC cases. Even the best of the best 'database caching' will fail against scalability requirements.

I don't know the specifics about the software architecture for the Joomla framework, but if you identify the read queries that hurt the application most, then you can simple add hooks in these places to query the cache first. Performing this, step by step will mean that you can achieve dramatic performance increases without having to extensively refactor the code base.

Regarding data-grids, what you are writing is correct from the developer's point of view. Developers (in most cases) have to extract data from hash-tables in both mechanisms. However it's not always true. There are some data-grids - so hash structure, but you have there semi-SQL language to access objects.

Other thing is that from scalability point of view "fundamental problem" can be something totally different. For me, the fundamental problem of on-disk storage is the hard-drive sector access time required for each operation. Compare this to in-memory data solutions that operate in RAM.

You are right that databases should still be improved, but other solutions e.g. not-SQL based, are also worth consideration. There are many situations when you basically don't need all the database functionality, ACID, two phase commit and so on.

If telecommunication guys thought like database guys, then the number of telephones in the world would be limited by the size of the largest switch used to connect them :)

Useful information shared..I am very happy to read this post regarding SQL RDBMS..thanks for giving us nice info.Fantastic walk-through. I appreciate this post.

Nice information, can use this for my training needs. Please keep on writing!

Very much like the comments of Erik, "If telecommunication guys thought like database guys, then the number of telephones in the world would be limited by the size of the largest switch used to connect them" nice thinking. Love the blog by the way.

Will use this for my turorials and training :-)

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment