How a Cassandra performance issue looks to a classification model

Avinash Mandava


Problems in distributed systems like Cassandra usually present themselves as:

  1. Higher latencies
  2. Request failures
  3. Downed nodes

It’s up to us to dig into our alerting and dashboards, parse it out, and figure out what’s wrong. In previous posts we’ve talked about using ML to help with certain ops tasks like debugging. Here we’ll show how we approach explainability of model output so operators can go beyond alerting and understand why a problem is happening.

What we want from a root cause analysis model

The goal of our model is to take an externally observable problem (like a latency spike, a dip in throughput, or any anomaly in an indicator) and turn it into

  • A list of probable root causes of that problem
  • For each probable root cause, an explanation of why it is being presented to you

What we supply as operators, domain experts and model trainers are:

  • What externally observable events should be considered “problems” that trigger the model? Some events like SLA violations can be defined by the operator, while others (like are learned by the system over time.
  • What potential root causes could the model map to? We need to define these as labels ourselves to make supervised learning work. This is the act of training the model. Ideally we only need to show a new label a few times for the model before it becomes confident in its classifications.

To be useful our model has to do the following well:

  • Coverage: be able to classify most possible root causes
  • Accuracy: Classify to the right root cause most of the time
  • Explainability: What features do the model believe were impactful in determining it’s classification decision?

Coverage and Accuracy are mostly about having enough training data. The more situations we are able to observe in customers’ clusters (or simulate on our own), the more root cause labels we can save (Coverage), and the more certain the model becomes in it’s decision making (Accuracy). The focus here will be on how we handle model explainability, with a focus on classification models.

Explaining what’s behind a root cause

We want to take the inputs (features) and outputs (labeled classes) of a classification model, and explain which inputs influenced a specific classification decision. In our example we have one feature per reported metric, and that feature has a value representing how anomalous that metric’s behavior was over the period of analysis.

We use SHAP values to explain model predictions. SHAP values interpret how much of an impact a given feature had on a given classification decision. They basically answer the following questions:

  • Which feature values pulled the model towards the class it chose?
  • Which feature values pushed the model away from the class it chose?

By calculating SHAP values from our Random Forest features and output, we can create force diagrams like this which help us visually explain results:


Red chevrons show us what feature values pulled the model towards it’s prediction. Blue chevrons tell us what pushed the model away from it’s prediction. We can also plot our most impactful features in impact plots like this:


Let’s look at two examples of performance issues in Cassandra, both of which presented as latency spikes and throughput dips, but were classified and explained differently by the model:

Case 1: Excessive use of lightweight transactions


This plot indicates that the following information is causing the model to classify this performance problem as a poor use of lightweight transactions:

  • Sudden spikes in write latency
  • Timeouts and failures in compare-and-set operations

Case 2: A bad case for leveled compaction


This plot indicates that the following information is causing the model to classify this performance problem as a poor use of leveled compaction:

  • How often times memtables flush to disk
  • How many sstables are in L0
  • Average partition size vs p95 partition size

Areas for enhancement

If you’ve run Cassandra (or any distributed system) before this should all make a lot of sense. Using an ML approach can filter out what doesn't matter, higlight what does, and map that information to actionable advice. You can even jump straight to detailed visualizations in your dashboarding tools given this list of important metrics to verify and see more nuanced details (we do this when we train our model on situations it hasn't seen before, that's how it learns with every new situation we see, and how we engineer new features). Overall, a lot easier than parsing through hundreds of graphs like below, 90% of which won’t be relevant to your problem:


Of course, this approach has a lot of areas for improvement. Here are a few enhancements we plan to make to help people grok our model’s output more easily:

  • Making feature values understandable. Lots of information is hidden in the anomaly score for a metric.
  • Visualization of important metrics. We currently take all the features that are impactful, find the corresponding metric, and plot it over the time period of interest. This gives us the ability to quickly show you where the source of problems occured. Highlighting specific portions of the time series that were relevant to the model coudl save operators even more time
  • We’re always adding new root cause labels and improving accuracy, but those are perpetual areas for model improvement

Subscribe for weekly updates, or reach out to us if you’re interested in seeing what your dashboards aren’t telling you.

Subscribe to our newsletter