NoSQL Series - Distribution Models.

The primary driver of interest in NoSQL has been its ability to run databases on a large cluster. As
data volumes increase, it becomes more difficult and expensive to scale up—buy a bigger server to
run the database on. A more appealing option is to scale out—run the database on a cluster of servers.
Aggregate orientation fits well with scaling out because the aggregate is a natural unit to use for
distribution.

Depending on your distribution model, you can get a data store that will give you the ability to
handle larger quantities of data, the ability to process a greater read or write traffic, or more
availability in the face of network slowdowns or breakages. These are often important benefits, but
they come at a cost. Running over a cluster introduces complexity—so it’s not something to do unless
the benefits are compelling.

Broadly, there are two paths to data distribution: replication and sharding. Replication takes the
same data and copies it over multiple nodes. Sharding puts different data on different nodes.
Replication and sharding are orthogonal techniques: You can use either or both of them. Replication
comes into two forms: master-slave and peer-to-peer. We will now discuss these techniques starting
at the simplest and working up to the more complex: first single-server, then master-slave replication,
then sharding, and finally peer-to-peer replication.

Single Server

The first and the simplest distribution option is the one we would most often recommend—no
distribution at all. Run the database on a single machine that handles all the reads and writes to the
data store. We prefer this option because it eliminates all the complexities that the other options
introduce; it’s easy for operations people to manage and easy for application developers to reason
about.

Although a lot of NoSQL databases are designed around the idea of running on a cluster, it can
make sense to use NoSQL with a single-server distribution model if the data model of the NoSQL
store is more suited to the application. Graph databases are the obvious category here—these work
best in a single-server configuration. If your data usage is mostly about processing aggregates, then a
single-server document or key-value store may well be worthwhile because it’s easier on application
developers.

For the rest of this chapter we’ll be wading through the advantages and complications of more
sophisticated distribution schemes. Don’t let the volume of words fool you into thinking that we
would prefer these options. If we can get away without distributing our data, we will always choose
a single-server approach.

Sharding

Often, a busy data store is busy because different people are accessing different parts of the dataset.
In these circumstances we can support horizontal scalability by putting different parts of the data onto
different servers—a technique that’s called sharding-

In the ideal case, we have different users all talking to different server nodes. Each user only has to
talk to one server, so gets rapid responses from that server. The load is balanced out nicely between
servers—for example, if we have ten servers, each one only has to handle 10% of the load.
Of course the ideal case is a pretty rare beast. In order to get close to it we have to ensure that data
that’s accessed together is clumped together on the same node and that these clumps are arranged on
the nodes to provide the best data access.

The first part of this question is how to clump the data up so that one user mostly gets her data from
a single server. This is where aggregate orientation comes in really handy. The whole point of
aggregates is that we design them to combine data that’s commonly accessed together—so aggregates
leap out as an obvious unit of distribution.

When it comes to arranging the data on the nodes, there are several factors that can help improve
performance. If you know that most accesses of certain aggregates are based on a physical location,
you can place the data close to where it’s being accessed. If you have orders for someone who lives
in Boston, you can place that data in your eastern US data center.

Another factor is trying to keep the load even. This means that you should try to arrange aggregates
so they are evenly distributed across the nodes which all get equal amounts of the load. This may vary over time, for example if some data tends to be accessed on certain days of the week—so there may be domain-specific rules you’d like to use.

In some cases, it’s useful to put aggregates together if you think they may be read in sequence. The
Bigtable paper [Chang etc.] described keeping its rows in lexicographic order and sorting web
addresses based on reversed domain names (e.g., com.martinfowler). This way data for multiple
pages could be accessed together to improve processing efficiency.

Historically most people have done sharding as part of application logic. You might put all
customers with surnames starting from A to D on one shard and E to G on another. This complicates
the programming model, as application code needs to ensure that queries are distributed across the
various shards. Furthermore, rebalancing the sharding means changing the application code and
migrating the data. Many NoSQL databases offer auto-sharding, where the database takes on the
responsibility of allocating data to shards and ensuring that data access goes to the right shard. This
can make it much easier to use sharding in an application.

Sharding is particularly valuable for performance because it can improve both read and write
performance. Using replication, particularly with caching, can greatly improve read performance but
does little for applications that have a lot of writes. Sharding provides a way to horizontally scale
writes.

Sharding does little to improve resilience when used alone. Although the data is on different nodes,
a node failure makes that shard’s data unavailable just as surely as it does for a single-server
solution. The resilience benefit it does provide is that only the users of the data on that shard will
suffer; however, it’s not good to have a database with part of its data missing. With a single server
it’s easier to pay the effort and cost to keep that server up and running; clusters usually try to use less
reliable machines, and you’re more likely to get a node failure. So in practice, sharding alone is
likely to decrease resilience.

Despite the fact that sharding is made much easier with aggregates, it’s still not a step to be taken
lightly. Some databases are intended from the beginning to use sharding, in which case it’s wise to
run them on a cluster from the very beginning of development, and certainly in production. Other
databases use sharding as a deliberate step up from a single-server configuration, in which case it’s
best to start single-server and only use sharding once your load projections clearly indicate that you
are running out of headroom.

In any case the step from a single node to sharding is going to be tricky. We have heard tales of
teams getting into trouble because they left sharding to very late, so when they turned it on in
production their database became essentially unavailable because the sharding support consumed all
the database resources for moving the data onto new shards. The lesson here is to use sharding well
before you need to—when you have enough headroom to carry out the sharding.

Reference:
Source: http://bigdata-ir.com/wp-content/uploads/2017/04/NoSQL-Distilled.pdf

Comments

Popular Posts