How does Read-Repair Work in Cassandra?

Posted on Posted in Blog

Read-repair is lazy mechanism in Cassandra that ensures that the data you request from the database is accurate and consistent. In this post, we’ll discuss how it works and how to tune it for production systems.

Some Background

Cassandra is a tunably consistent system that prioritizes availability over consistency. This means that servers can go down and datacenters can go offline but the database will continue to function. While the server or other datacenter is down, the other nodes in the cluster will store inserts, updates and deletes in a journal called the “hints”. When the datacenter or server comes back online, those mutations stream over and everything is hunky dory. But what if something happens and a mutation gets dropped?

When performing reads the client will ask any node in the cluster to fetch a row. We call this node the “coordinator”. The coordinator will look at the primary key for the row the client requested, and will figure out which nodes in the cluster are supposed to have the data. At consistency level ONE the coordinator will choose the node that has the lowest system load and send it a request for the data. This works great if all nodes have the same data, but not so much if you happen to choose the node that missed an update.

Hello Read-Repair

A database that serves old data isn’t very useful, to make sure that doesn’t happen Cassandra has three mechanisms that kick in to detect and correct inconsistencies. For a randomly selected portion of reads Cassandra will return the data, but it will also kick off a background request that checks the consistency of the data across all the replicas in the cluster.

Instead of sending over and comparing the raw data, each node will compute a hash and return it to the coordinator. If these hashes are the same then the record is consistent and everything is fine. If they disagree, the nodes exchange data and the value with the last timestamp wins. This process is called read-repair.

How Much and How Often?

Cassandra has two configuration settings that control how often read-repair takes place. The first is dclocal_read_repair_chance, this setting controls how often read-repair for happens for requests in the local datacenter. You can set it on a per table basis, and it has a value between 0.0 (never) and 1.0 (always). Read-repair isn’t free because it requires reading the values on each server to compute the hash, so you typically don’t want to set it above the default 0.1 (10%).

What About Cross-Datacenter Repair?

Cassandra also has another configuration variable called read_repair_chance that will kick off the read-repair process for all nodes in the cluster across all datacenters. This was historically also set to 0.1 but now defaults to 0.0.

A couple years ago, I was helping a major retailer on their real-time recommendation system that was having latency issues. They were under-provisioned and had some other issues, but while diagnosing the problem, we noticed that a fair bit of CPU time was spent computing hashes for their large partitions. We changed their read_repair_chance to 0.0 and were able to eliminate this load. We decided to make it the default because most people that read at consistency level ONE or LOCAL_ONE are latency sensitive, and at higher consistency levels another consistency mechanism kicks in.

What if I’m Not Reading at ONE?

If you’re reading at a consistency level higher than ONE the coordinator is already fetching data from multiple nodes and comparing timestamps to see if the values agree. If the values don’t agree the nodes will exchange data to make sure everyone has the correct value.

If I Read at QUROM or LOCAL_QUROM I’m Safe?

Yup! If you read at QUOROM or LOCAL_QUROM you’re already checking for consistency on every read and you don’t have to worry about read-repair.

What if I Really Want to Read at One?

By using ONE you’re already acknowledging that your prioritizing latency over consistency; you can’t always have both. If that’s acceptable for your application then the default of 0.1 for dclocal_read_repair is probably correct for most people. Cross datacenter read-repair is a little harder to give a general recommendation for, but if you want to use the feature I would probably also recommend 0.1 or lower. Of the hundreds of clusters I’ve worked on, I’ve only ever heard of one deployment needing a higher value.

How Do I Know it’s Working?

Cassandra tracks the number of times it performs certain activities, and keeps a running tally since the process has been started. These stats are available in JMX, but you can also view them using the included nodetool command. The count is available as the ReadRepairStage stat.

Wrapping Up

Read-repair is one of the three consistency mechanisms that Cassandra uses to make sure that the data in your cluster is consistent and accurate. When it’s used with hints, tunable consistency and active repair it’s possible to build a reliable and highly available system that can survive hardware failure or major datacenter outages.

Learn More

This post is part of an ongoing series; if you are interested in learning more, sign up for our newsletter. Or if you are building a large-scale distributed system don’t be afraid to contact us, we’d be happy to help.