We’ve been writing about automating infrastructure operations with machine learning (ML). Out of all the possible infrastructure components, we’ve focused mostly on distributed data systems in our examples. In this post we identify the 3 key properties of distributed data systems that make them ideal targets for machine learning models, and survey which ML approaches can cut through the hype and tangibly improve operator workflows.
AI and ML are not magic. Fully autonomous infrastructure systems aren’t here yet, and you should be very skeptical of anyone claiming to have figured this out already. It’s going to be a journey to get there, but for a lot of operators, there are tangible benefits that machine learning can provide today, like alert noise reduction, fault localization, config recommendations, workload classification, and more.
Different approaches use different levels of machine learning, some applying simple statistical models, and some using more complex approaches like reinforcement learning, bayesian optimization, and neural nets. That means not every system is going to be a great fit for machine learning, and for some applications it’s overkill or underwhelming. But if you’re managing dozens of servers, cross-system communications, and multiple component technologies, there are obvious and tangible benefits. Distributed systems have all of these properties and more.
Managing state creates risk. A data management system is the quintessential example of a stateful system. Data is inserted in, and data is read out. The system has to accept that data and serve it in a timely, highly-available, and accurate manner to be considered stable. The fact that data systems maintain state mean they’re riskier.
This is why enterprises spend so much money on backup software, and why data safety is the foundation of the database reliability hierarchy of needs. In an ideal world, we’d get warned ahead of time before issues cause data loss or downtime.
[Note: The topic of state management is so rich that we’re writing a full post about how different distributed systems can be bucketed by how complex their management of state (the application data) is. Subscribe in the form on the right and we’ll shoot you an email when that post is published.]
Distribution of processing creates an overwhelming amount of observability data. Lots of servers means lots of activity that needs to be tracked and correlated. As if that weren’t enough, all these servers need to work in concert to maintain data accuracy and consistency. This means you have a lot to keep track of.
Look at a system like Cassandra, that throws off thousands of metrics. Not only do you have to sift through them by node to figure out what’s happening on any given server, but you also need to know how to map each node’s behavior to a higher-order understanding of what the cluster as a whole is doing. One node might be bogged down in compactions, but the request are serving fine. How do you know if this is something to worry about or if you can safely ignore it and let it resolve? In an ideal world we’d cut through the noise and get shown only what we need to know to fix an issue or improve performance and availability.
Support for flexible data models leads to difficult debugging scenarios. The more complicated an access pattern is (think the open queries that bring down your Elasticsearch clusters or the lightweight transactions that cripple your Cassandra deployments), the harder the system has to work to either ingest the data safely (e.g. distributed transactions in Cassandra), maintain it safely in the background (e.g. intensive repairs in Cassandra), or serve it effectively (e.g. heavy index scans for an open Elasticsearch query). The work is done somewhere, and the only way to reduce the work is to simplify the access patterns (like in Kafka) so the system doesn’t have to do a ton of work to coordinate operations.
Some systems are highly specialized, but many of the most popular, like databases and search engines, allow for a variety of access patterns. This can be good if you know what you’re doing, but get you in trouble if you don’t (every project has “supported features” experienced users know to be very careful with). And since almost all problems present themselves as latency spikes, throughput dips, or downtime, it’s up to you to go digging into lower level indicators to figure out what’s wrong. In an ideal world we’d get shown a list of possible problems based on past experience, and jump straight to trying out recommended fixes.
So how can we take all these problems distributed systems present and apply ML to create this “ideal world”, where all the complex and tedious stuff is handled and we can focus on making decisions? Well, it’s not a straight execution path. You can’t build some all-knowing model that can do everything. And stitching together a bunch of specific models into a single autonomous operator is both harder than it looks, and premature for most use cases. The right approach is to break down systems operations into a set of problems that operators are responsible for addressing, and choose actual pain points to address with acute models.
Going back the 3 of properties we defined above, we can break down varying levels of ML approaches for each that can benefit operations teams in the near term:
This is the hardest part to automate, but also the most painful and stressful part of system operations, especially for highly customizable systems with a ton of metrics and configuration options. But once we’ve seen enough patterns and behavior, we can start learning about which metrics are leading indicators of bad situations, and which configuration options are optimal for different workloads. All of these are complex approaches, because learning and automating at this level requires complex analysis of massive amounts of information.
If you’d like to see what we’ve got in the works and how we’re using the above ML approaches to make distributed systems operations easier, subscribe to our newsletter for updates, request a demo, or contact us for more information.