At Vorstella we’re building an AI expert that helps teams run distributed systems at scale. A while back our team was passing around some research papers about autonomous databases and optimizing system configurations, and they grabbed our attention. This post walks through how we used Bayesian Optimization to build an auto-tuning system for Cassandra and achieved a 43% latency reduction and an 80% throughput increase over documented best practices.
Performance tuning in Cassandra
Cassandra comes with dozens of tuning knobs and is used in hundreds of different use cases with high variance in data models and access patterns. This makes performance tuning difficult. In an ideal world, we’d be able to figure out how Cassandra performs under each setting one time, map a performance function surface, and pick the highest point representing the set of configuration option values that yield the best performance (see Figure 1).
But the performance function for Cassandra is highly sensitive to variations in workload, hardware, and data model. Small differences in these inputs can drastically change the behavior of the function and turn what was an optimal configuration into a performance headache. We always knew that simple calculations were too simplistic to be accurate, but never had time to explore all the configs for each workload to find the best one. The latest research suggested Bayesian Optimization was a good candidate to help solve this uncertainty and find an optimal configuration with just a few observations.
Birds-eye view of Bayesian Optimization
We’ll try to break down the relevant parts of what’s attractive about Bayesian Optimization and leave the rest for further reading. Let’s take a 2-dimensional case, where we are trying to predict the value of a continuous variable Y (e.g. expected P99 latencies) given the value of an input variable X (e.g. the setting value for concurrent_reads). One way to do this is to guess the function f(X) that directly explains the relationship between X and Y. Or put in configuration terms, a function that lets us plug in a number for concurrent_reads and see what P99 latency to expect.
Bayesian Optimization assumes that we don’t know what the objective function, f(X) looks like at first, but that we know enough about the problem space, to construct a “prior distribution” of potential representations of f(X) that we know are at least possible. It then starts choosing new values of X to try, deciding when to explore (try new values of X to learn more about f(X)), and when to exploit (try old values of X to confirm it’s understanding of f(X)). Every data point it sees helps it narrow the set of possible functions it believes to be the likeliest representations of f(X), until it knows enough about f(X) to be confident in its choice of a global optimum value for X. Bayesian Optimization is good at efficiently finding global optimums, performing well with a limited number of observations.
We deployed our experiment clusters as StatefulSets in Kubernetes (we used GKE). The setup was as follows:
- Cluster: 1-node cluster, Apache Cassandra™ version 2.2.12
- Hardware: 4000M CPU, 32GB memory, and 1 500GB SSD storage volume
- Data model: Simple K/V data model with 8MM unique, 1KB partitions
- Load: 80% reads / 20% writes
cassandra-stress workload with 8 threads
- Fixed settings (set by us): JVM max_heap_size = 16GB
- Variable settings (changed by the algorithm): Memtable cleanup threshold, Concurrent writes, Concurrent reads, Memtable flush writers, Memtable allocation type.
- Observation period length: 920 minutes
We ran a series of load tests, grabbing a config from the algorithm, running it, and reporting back the observed P99 latencies, 22 times. We stopped when the model kept generating out very similar recommendations, indicating convergence.
The following image depicts the results of our experiment:
We calculated a “best practice” config values from the open-source documentation (see config option links above for more info on calculations) to use as a baseline to compare against.
Looking at metric values over the run helps retroactively explain the results.
1. P99 read latencies (plotted in milliseconds by configuration over each run), our performance metric, were 43% lower under the optimal config.
2. Total throughput (plotted in GB by configuration over each run) was 1.8X higher under the optimal config.
3. Less data (plotted in GB by configuration over each run) accumulated on disk and major compactions happened on a tighter cycle under the optimal config.
4. Fewer SSTables (plotted in
In retrospect this makes sense. We’re overwriting the same 8MM f keys with our workload, and reading them out. If we use a low memtable_cleanup_threshold value, as long as we have enough i/o for compaction to keep up with the increase in data files flushing to disk, we can compact out duplicate data frequently and, keep total data on disk low. In this case, when memtable_cleanup_threshold is 0.11, we keep data so low that we can fit most of our data in memory, limiting i/o.
And, as we can see, increasing memtable_cleanup_threshold to 0.2 had pretty dramatic effects. Suddenly we flush less frequently, and compact less relative to the amount of data on disk. This lets more duplicates accumulate, and increases the data volume required to support the same use case. Once data spills to disk and i/o introduces a new bottleneck, everything slows down. As data volume increases with the “best practice” config, throughput slows down and P99 read latencies begin to rise, until the next major compaction, where data fits into memory again and the system speeds back up.
Reflections and future work
The above example was admittedly simple: running on one node, changing a few knobs, with very low workload complexity. Changing the workload, the hardware, or the data model would change the behavior of the performance surface, even under the exact same configuration settings. But this is precisely the point. At the outset, the model knew nothing about the system. It just used observations to fill in the gaps in its knowledge, and find a “good enough” solution.
We’re continuing to explore autotuning with more impactful settings like heap size and system resources. Given how sensitive Cassandra performance can be, we figure there’s a fair bit of inefficiency out there resulting from configuration uncertainty or lack of bandwidth to devote time to performance tuning.
This is just part of the work we’re doing to help teams accomplish more with less effort. Our AI expert uses anomaly detection, classification, neural networks, and more to help you to transform the way you manage your data systems. If you’re interested in hearing more, reach out to [email protected], or visit www.vorstella.com and register to receive email updates on our progress.