Engineering teams are being asked to manage a growing amount of infrastructure software like databases, search engines, and message queues. To operators, these systems often feel like black boxes that are hard to observe and manage. In this post we draw from control theory to propose a framework that operators can apply to simplify infrastructure software management.
This post builds on both control theory and ongoing work from the software observability community. Some of the more prolific writers in the observability community are Cindy Sridharan, Charity Majors, Jaana Dogan. See their work for a more detailed background on the observability space.
Control theory is a field that deals with the control of continuously operating dynamical systems. Our goal in applying control theory is to come up with a control model we can use keep a system stable. Central to control theory are the observability and controllability of the system being controlled.
If a system is sufficiently observable and controllable, our control model can take a target system state, continuously observe the system’s actual state, and apply any corrections needed to keep that system in its target state.
Control theory is a useful abstraction for managing software systems. First let’s rephrase the key components of control theory in software-specific terms.
Imagine our job is to keep a running infrastructure deployment like a database up and performant. Essentially, given availability and performance targets, we need to develop a control model to ensure that our database hits those targets.
Observability and controllability are simply properties of the system. A control model needs to make use of the system’s observability and controllability properties to maintain stability. This involves:
In your bespoke applications you have complete control over observability and controllability, and you can ensure that with enough effort you can confidently own your operations strategy (i.e. build a good control model). But with infrastructure software, observability tends to break down.
It will help to provide a more detailed description of what we mean by an infrastructure software system. In an infrastructure software system:
To operators, infrastructure often feels like a black box. You don’t control the metrics and logs the systems use to express themselves, and you need deep expertise and knowledge of software internals and source code to interpret a system’s state from the metrics and logs you can observe.
This often creates a ceiling for observability in your application stack - you can instrument your applications and services all you want, but you’ll eventually run into a black box OTS system down the line (e.g. when a trace shows a slow database query, but doesn’t give you insight into the database itself).
We hope the following framework can help operators more easily interpret system state from the external outputs of their infrastructure. This framework can also serve as a guide for infrastructure software developers to build better observability into the systems they create. We will draw examples from our experience managing Apache Cassandra clusters. Future posts will apply this framework to other popular infrastructure tech.
We can build and understanding of a system’s state by answering the following questions:
What’s going into and out of the system?
What’s happening inside the system?
What’s happening between the system and the operating environment?
We propose a simple interpretation approach, categorizing all available external outputs into the significant Golden Indicators (throughput, latency, errors, saturation) that can help us answer the above questions. What follows is a simplified illustration of how the above framework can apply to Apache Cassandra. A future post will apply the same framework in greater depth.
Cassandra’s inputs and outputs are write and read operations, respectively. We need to observe the following metrics to get a clear understanding of this layer:
Cassandra has multiple internal subsystems including:
Each of these subsystems has its own set of indicators that provide a picture of its state. For example, here are some of the available Golden Indicators for the write subsystem:
Standard operating system metrics can tell us a lot, especially when correlated with events happening at the software level. Knowing that disk saturation coincides with blocked write threads in Cassandra gives us a good indicator that our system doesn’t have enough i/o to handle our workload. Details about how to improve operating system visibility are beyond the scope of this discussion. Our focus should remain on:
Using the Cassandra write path example above, we may want to correlate saturation in Cassandra’s write path to saturation of underlying resources. Saturation in the write path could be caused by excessive CPU utilization, making it difficult to accept new in-memory mutations. Write path saturation can alternatively be caused by saturation at the disk level, blocking flushes to disk.
In future posts we will apply the above framework to a variety of distributed data systems including Cassandra, Kafka, Elasticsearch and more. We will also explore topics relating to the second half of our control theory model, Controllability and Remediation. We hope this starts an earnest conversation between the communities and vendors behind these systems and the operators that run them, with a goal of improving usability and adoption and giving operators the tools and information they need to avoid operational bottlenecks.