Problems in distributed systems like Cassandra usually present themselves as:
It’s up to us to dig into our alerting and dashboards, parse it out, and figure out what’s wrong. In previous posts we’ve talked about using ML to help with certain ops tasks like debugging. Here we’ll show how we approach explainability of model output so operators can go beyond alerting and understand why a problem is happening.
The goal of our model is to take an externally observable problem (like a latency spike, a dip in throughput, or any anomaly in an indicator) and turn it into
What we supply as operators, domain experts and model trainers are:
To be useful our model has to do the following well:
Coverage and Accuracy are mostly about having enough training data. The more situations we are able to observe in customers’ clusters (or simulate on our own), the more root cause labels we can save (Coverage), and the more certain the model becomes in it’s decision making (Accuracy). The focus here will be on how we handle model explainability, with a focus on classification models.
We want to take the inputs (features) and outputs (labeled classes) of a classification model, and explain which inputs influenced a specific classification decision. In our example we have one feature per reported metric, and that feature has a value representing how anomalous that metric’s behavior was over the period of analysis.
We use SHAP values to explain model predictions. SHAP values interpret how much of an impact a given feature had on a given classification decision. They basically answer the following questions:
By calculating SHAP values from our Random Forest features and output, we can create force diagrams like this which help us visually explain results:
Red chevrons show us what feature values pulled the model towards it’s prediction. Blue chevrons tell us what pushed the model away from it’s prediction. We can also plot our most impactful features in impact plots like this:
Let’s look at two examples of performance issues in Cassandra, both of which presented as latency spikes and throughput dips, but were classified and explained differently by the model:
This plot indicates that the following information is causing the model to classify this performance problem as a poor use of lightweight transactions:
This plot indicates that the following information is causing the model to classify this performance problem as a poor use of leveled compaction:
If you’ve run Cassandra (or any distributed system) before this should all make a lot of sense. Using an ML approach can filter out what doesn't matter, higlight what does, and map that information to actionable advice. You can even jump straight to detailed visualizations in your dashboarding tools given this list of important metrics to verify and see more nuanced details (we do this when we train our model on situations it hasn't seen before, that's how it learns with every new situation we see, and how we engineer new features). Overall, a lot easier than parsing through hundreds of graphs like below, 90% of which won’t be relevant to your problem:
Of course, this approach has a lot of areas for improvement. Here are a few enhancements we plan to make to help people grok our model’s output more easily:
Subscribe for weekly updates, or reach out to us if you’re interested in seeing what your dashboards aren’t telling you.