To shard or not to shard
[Contributed by Mateusz Kyc]
IT industry is one big word generator. Even better, it is a meaning generator. There are words or acronyms to describe every single functionality, but the same time one word can describe a bunch of different functionalities or one functionality can be described by two or more different words.Recently I have found an old article about differences between sharding and partitioning in general. This article tries to describe sharding as partitioning. It is not my aim to judge on this but I would like to write a couple of sentences about use of sharding.
How does it work?
Sharding is simply a division of a big set of data into many small packs. Databases are the specimen - you divide your old big database in many small databases each located on a separate machine. You rely on application level layer to decide to which "shard" to send query (here is a small difference from partitioning where decision is made at database level).
When to use it?
Always! That was a joke - do not do it. Sharding can be useful in many cases but sometimes is better not to use it. Let's focus on two scenarios today. In first scenario let's look into example of social website with image upload option. User can upload a big set of photos at once. Classic master-slave architecture of database will fail sooner or later because it is designed to handle a huge number of select queries. All insert statements go via master server so it will became performance bottleneck. We can use sharding architecture in this example to pass each insert to a separate database. Thanks to this users will not suffer from "ages taking" uploads. A given master controller application has to decide to which database insert should go and that's it. Next scenario will describe searching problem. Let's take company with enormous set of historical data. An analyst from this company wants to analyze it. Data is stored in one database. The analyst created quite advanced algorithm to find interesting data. He even translated this algorithm into some freaking complicated PL/SQL procedure, started this procedure and ... now he can wait until next year. Such situations happen quite often. Even with best partitioning and a rocket science, advanced, SINGLE database you just cannot do this. You can invest millions of dollars and buy a brand new shiny high-end database cluster but why do it if you can use your department PC's as grid infrastructure. Put a small set of data in each PC and perform search again and again. This scenario shows power of grid-based sharding infrastructure for searching. At application level query is decided to be sent to all nodes or just to some subgroups. Each node searches in its small database and returns results. In a jiffy you have your results and you saved your money. Power plant engineer is the only one who noticed that you have been calculating something.
Is it used?
Yes, regarding first example, Flickr is exactly what I am writing about. This is the way they do it. There is an article on the web with detailed description. Regarding second scenario, Google can be used as an example. They are not doing sharding on databases but the way they divide data is very similar to sharding from architectural point of view. Map-reduce operation on cloud/grid/you-name-it architecture is one of the best patterns how to solve "big-search" scenario.
Industry faces different problems but sooner or later each company will need to work on big data-sets or will need more compute power. It is a smart movement to introduce scalable infrastructure elements in your company and database sharding on grid infrastructure can be the first step in good direction.