3 reasons distributed data systems are a perfect use case for AIOps

Avinash Mandava


We’ve been writing about automating infrastructure operations with machine learning (ML). Out of all the possible infrastructure components, we’ve focused mostly on distributed data systems in our examples. In this post we identify the 3 key properties of distributed data systems that make them ideal targets for machine learning models, and survey which ML approaches can cut through the hype and tangibly improve operator workflows.

AI and ML are not magic. Fully autonomous infrastructure systems aren’t here yet, and you should be very skeptical of anyone claiming to have figured this out already. It’s going to be a journey to get there, but for a lot of operators, there are tangible benefits that machine learning can provide today, like alert noise reduction, fault localization, config recommendations, workload classification, and more.

Different approaches use different levels of machine learning, some applying simple statistical models, and some using more complex approaches like reinforcement learning, bayesian optimization, and neural nets. That means not every system is going to be a great fit for machine learning, and for some applications it’s overkill or underwhelming. But if you’re managing dozens of servers, cross-system communications, and multiple component technologies, there are obvious and tangible benefits. Distributed systems have all of these properties and more.

3 properties of distributed data systems

1. State management brings risk

Managing state creates risk. A data management system is the quintessential example of a stateful system. Data is inserted in, and data is read out. The system has to accept that data and serve it in a timely, highly-available, and accurate manner to be considered stable. The fact that data systems maintain state mean they’re riskier.

This is why enterprises spend so much money on backup software, and why data safety is the foundation of the database reliability hierarchy of needs. In an ideal world, we’d get warned ahead of time before issues cause data loss or downtime.

[Note: The topic of state management is so rich that we’re writing a full post about how different distributed systems can be bucketed by how complex their management of state (the application data) is. Subscribe in the form on the right and we’ll shoot you an email when that post is published.]

2. Distributed clusters bring overwhelming amounts of data

Distribution of processing creates an overwhelming amount of observability data. Lots of servers means lots of activity that needs to be tracked and correlated. As if that weren’t enough, all these servers need to work in concert to maintain data accuracy and consistency. This means you have a lot to keep track of.

Look at a system like Cassandra, that throws off thousands of metrics. Not only do you have to sift through them by node to figure out what’s happening on any given server, but you also need to know how to map each node’s behavior to a higher-order understanding of what the cluster as a whole is doing. One node might be bogged down in compactions, but the request are serving fine. How do you know if this is something to worry about or if you can safely ignore it and let it resolve? In an ideal world we’d cut through the noise and get shown only what we need to know to fix an issue or improve performance and availability.

3. Access pattern variety makes troubleshooting harder

Support for flexible data models leads to difficult debugging scenarios. The more complicated an access pattern is (think the open queries that bring down your Elasticsearch clusters or the lightweight transactions that cripple your Cassandra deployments), the harder the system has to work to either ingest the data safely (e.g. distributed transactions in Cassandra), maintain it safely in the background (e.g. intensive repairs in Cassandra), or serve it effectively (e.g. heavy index scans for an open Elasticsearch query). The work is done somewhere, and the only way to reduce the work is to simplify the access patterns (like in Kafka) so the system doesn’t have to do a ton of work to coordinate operations.

Some systems are highly specialized, but many of the most popular, like databases and search engines, allow for a variety of access patterns. This can be good if you know what you’re doing, but get you in trouble if you don’t (every project has “supported features” experienced users know to be very careful with). And since almost all problems present themselves as latency spikes, throughput dips, or downtime, it’s up to you to go digging into lower level indicators to figure out what’s wrong. In an ideal world we’d get shown a list of possible problems based on past experience, and jump straight to trying out recommended fixes.

So where are the ML opportunities?

So how can we take all these problems distributed systems present and apply ML to create this “ideal world”, where all the complex and tedious stuff is handled and we can focus on making decisions? Well, it’s not a straight execution path. You can’t build some all-knowing model that can do everything. And stitching together a bunch of specific models into a single autonomous operator is both harder than it looks, and premature for most use cases. The right approach is to break down systems operations into a set of problems that operators are responsible for addressing, and choose actual pain points to address with acute models.

Going back the 3 of properties we defined above, we can break down varying levels of ML approaches for each that can benefit operations teams in the near term:

Dealing with information overload - Separating signal from noise

  • Anomaly detection (simple approach): Simple anomaly detection can tell you things like “metric A jumped 50% over the last 30 minutes”, or “variance across nodes for metric B has risen above the 30-day p99 value”. This saves a lot of time setting up alerts and maintaining the set of indicators that you think are relevant. It takes years to build confidence in your approach, and you have to relearn this process every time you use a new, unfamiliar technology.
  • Correlation (medium approach): Correlation of events and metric data can help us filter out noise and This requires a little bit more learning, as we can look over history to build an understanding of which metrics should be correlated with each-other, and then alerting when that correlation deviates from known patterns.
  • Forecasting (complex approach): If we have enough history of a metric we can start to build models to forecast the values we expect to see in the future based on our past data, accounting for noise and outliers.

Dealing with debugging - classifying issues based on past knowledge

  • Decision trees (simple/complex approach): Decision trees can be quite complex, but they can also be simple logical flows that might incorporate small models in nodes of the tree, but mostly look like rule-based systems. For example, a series of if-then statements in a tree, where some of the tree nodes are replaced by lightweight classification algorithms, rather than rules.
  • Classification (medium/complex approach): Again, this can be relatively simple or complicated. We wrote a longer post about this here, check it out for more details. The idea is to create a vector representation of system behavior, and run it through a classifier to map it to the closest previously known/described system state.

Dealing with risk and uncertainty - prediction and optimization

This is the hardest part to automate, but also the most painful and stressful part of system operations, especially for highly customizable systems with a ton of metrics and configuration options. But once we’ve seen enough patterns and behavior, we can start learning about which metrics are leading indicators of bad situations, and which configuration options are optimal for different workloads. All of these are complex approaches, because learning and automating at this level requires complex analysis of massive amounts of information.

  • Neural networks: With all the raw data these systems throw off, we can send raw data into a multi-layer neural network, and over time allow it to learn how different calculated features of inputs create different results.
  • Reinforcement learning: This is perhaps the most obvious approach to control engineering, which as we’ve written is a great abstraction for software systems management. We allow an agent to make changes to a system, observe the state of that system and a “reward” to the agent, and the agent optimize its behavior to maximise its reward. An example might be identifying a set of “stable” system states, and providing a very low reward when the agent puts the system in an unstable state, but a very high reward when the agent puts the system in a more stable state. As it tries to maximize its reward, the agent gets more confident in its understanding of the system’s behavior under different inputs.
  • Bayesian optimization: If we want to offset risk and optimize performance, we need to understand how our system behaves under different conditions like config options, workload patterns, etc. Bayesian optimization allows us to use past observations of system behavior under different conditions to map out the expected performance of that system given a combination of config options (or any input variables). We wrote about doing this for Cassandra here, see that post for more details.

Want to learn more?

If you’d like to see what we’ve got in the works and how we’re using the above ML approaches to make distributed systems operations easier, subscribe to our newsletter for updates, request a demo, or contact us for more information.

Subscribe to our newsletter