Sharding and Replication have become important tool in today’s application database design. We have couple of ongoing projects where we are working with MongoDB NoSQL database. Hence my blog is going to talk about how these concepts are implemented in MongoDB NoSQL database.
What is Replication?
As the name suggests, replication is keeping multiple copies of data on multiple servers to recover quickly from hardware failure. With multiple copies of data, you can increase database availability in case any single server fails. Depending on the application’s requirement, we can configure to increase read capacity, send read and write operations to different servers.
Replication in MongoDB
Replication can be compared to traditional master/slave setup. In MongoDB, a replica set is a group of instances (primary and secondary) that host the same data set. Primary (master) instance accepts all write operations from clients. Similar write operation is then carried out on secondary (slaves) instances using same data set.
By default client reads from primary instance. But we can configure read preference to send read request to secondary instances.
Replication has also provision to ensure that if its primary instance fails; one of the secondary instances will be elected as a primary instance.
What is Sharding?
Sharding can be compared to partitioning. When data size increases, a single server is not sufficient to store data. You may also face performance issues by not providing acceptable read and write throughput.
Traditional approach of handling scaling is to add more CPUs and RAMs. But it is expensive and has practical limitations.
Sharding, on the other hand, is nothing but horizontal scaling. It is a process of distributing data across multiple servers, or shards.
Sharding in MongoDB
We have to first understand the concept called Sharded cluster. When you walk into a library, you hand over your request to the librarian. Librarian checks his system and he exactly knows where to find the book you are looking for. Sharded Cluster works exactly like this.
Sharded cluster has shards, query routers and config servers. Shards store datasets, query routers interface between client application call and its job is to find the required data and pass it back to the client application. Config server stores the metadata that tells query router which dataset belong to which shard. Query router uses this metadata to get data from specific shard and pass it back to client application. Most sharded cluster has many query routers.
How data is distributed across shards?
In MongoDB, collection is like a table. Individual rows in a table are called document. In traditional database, we partition data based on certain unique key. Similarly MongoDB distributes data, or shards, at the collection (table) level and data is partitioned by the shard key. Shard Key is based on an indexed key that exists in every document in the collection. MongoDB uses either range based partitioning and hash based partitioning to divide sharded keys. You can read more about it in MongoDB technical documentation.
When is the right time for Sharding/ Replication?
All production deployments should use replication to provide high availability and disaster recovery.
When your data size is small then begin with an unsharded deployment. Configuring sharding is easy hence configuring sharding while your data set is small is an additional overhead without any additional performance benefits.