We’ve spent our careers helping enterprises adopt, run, and scale open-source tech like Cassandra, Kafka, Spark, and Elasticsearch to make the most of their data. We’re taking our years of domain knowledge and combining it with machine learning to create an AI assistant that helps operators run data platform technologies at scale. This post breaks down the opportunities that exist to apply AI to the management of distributed systems, with a focus on root cause analysis.
As an illustration, let’s dig into what it takes to manage a distributed system, and why central teams have a hard time scaling to support demand. When we’re managing a distributed system deployment we’re given a few things up front:
We’re supposed to use our knowledge and the tools we’re given to meet that deployment’s performance and availability targets. This inevitably involves the following critical steps:
Going from raw metrics to useful signals can be really easy (basic thresholds and simple rules) or really complicated (dynamic anomaly detection on thousands of time series features). Most of us use our monitoring tools to codify our past knowledge about what metrics mean and what they indicate. We create alerts and graphs and use those to interpret situations. But we don’t consider the signals or situations we haven’t seen before, so we’re constantly walking the line between ignoring possibly critical events and drowning in alert noise.
Next, we have to figure out how to interpret combinations of events to form opinions. This becomes even more urgent when our performance and availability targets are at risk. Without deep expertise in operating these systems, every issue can be a showstopper. It takes years to learn enough to easily correlate all the signals into a well-formed opinion. And when you have to do this for dozens of deployments across several technology stacks, you quickly get overwhelmed.
The approach we take to help operators manage the variety and complexity of their data platforms involves combining domain knowledge with machine learning on telemetry data. It’s impossible to separate the signal from the noise manually, and our systems aim to surface what’s important and impactful to your SLA targets, so you can avoid digging through dashboards and get right to the information you need to resolve the issue.
One of the things we use ML for is helping determine the root cause of an SLA violation (also known as “fault localization”). In this section we’ll focus on this problem and go over the phases of data transformation and the types of algorithms we use at each stage. A simplified breakdown of the flow of information is shown below:
Imagine you get a phone call from a developer or an alert from your APM telling you that a database you’re responsible for has violated the established read response SLA. Now let’s walk through how an AI assistant would turn that alert into an explanation of what’s happening, using the pipeline pictured above.
The first thing an expert would do is look at the alerts generated, key indicators, and important “changes” around the time the issue occurred. This essentially creates a list of interesting things that happened in the system around the time responses slowed down.
Extracting events from raw time series data is a common problem, with extensive literature and applications in industries from APM to medical devices. By feeding time series data into anomaly detection algorithms, we can:
We run this feature extraction continuously on thousands of time series metrics, scanning for several types of anomalies or strange behaviors. When we see an event like an SLA violation, we pull all events surrounding the violation into a vector representing the state of the system (Note: we apply dimensionality reduction to the resulting vector, but that’s beyond the scope of this post). The vector of events associated with the SLA violation is the input into our next processing layer, root cause classification
An SLA violation in a system like Cassandra can have dozens of potential root causes. Disk saturation, GC pressure, thread pool saturation, changes in workload behavior, incorrect table settings, and more. Mapping the vector of events we created in the feature extraction step to a suspected root cause that we’ve seen before is what classification algorithms are good at.
At this point, we have a vector of events triggered by some known “bad event” like an SLA violation. This goes through a classification algorithm (many of the classification approaches in the diagram can be effective, it depends on the nature of the data). Every time we see a new issue and label it, the classification algorithms learn and improve, enhancing our ability to respond to similar production issues in the future. We can see when our models get “confused” between two scenarios, and we can adjust the feature engineering process accordingly to create more separation between the vector representations. With all the scenarios we’ve seen we’re also training predictive models that can warn you when impactful indicators start surfacing, allowing you to respond before your target SLAs are impacted.
At Vorstella, we’re building an A.I. agent that helps teams manage their data platforms in-house, at scale. We apply a machine learning approach, using classification algorithms, regressions, decision trees, and ensemble learning to help with debugging, tuning, best practices enforcement, and capacity planning. Email us at email@example.com for more information.