Control theory and infrastructure management

Avinash Mandava


Engineering teams are being asked to manage a growing amount of infrastructure software like databases, search engines, and message queues. To operators, these systems often feel like black boxes that are hard to observe and manage. In this post we draw from control theory to propose a framework that operators can apply to simplify infrastructure software management.

This post builds on both control theory and ongoing work from the software observability community. Some of the more prolific writers in the observability community are Cindy Sridharan, Charity Majors, Jaana Dogan. See their work for a more detailed background on the observability space.

Control theory - basic concepts

Control theory is a field that deals with the control of continuously operating dynamical systems. Our goal in applying control theory is to come up with a control model we can use keep a system stable. Central to control theory are the observability and controllability of the system being controlled.

  • Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
  • Controllability is a measure of the ability of an external input to move the internal state of a system from any initial state to any other final state in a finite time interval


If a system is sufficiently observable and controllable, our control model can take a target system state, continuously observe the system’s actual state, and apply any corrections needed to keep that system in its target state.

Applying control theory to infrastructure management

Control theory is a useful abstraction for managing software systems. First let’s rephrase the key components of control theory in software-specific terms.

  • Internal states are the set of possible behaviors we may see in a system. A system’s state can be expressed a number of ways such as relation to a threshold (“service level indicator above target threshold”), or through a custom definition (“healthy enough that nobody should complain”).
  • External outputs are metrics, logs, events or other information the system outwardly expresses about itself.
  • External inputs are actions operators can take that affect system behavior, like allocating more system resources, tuning settings, changing the workload against the system.
  • Our control model is the operational model we use, encompassing how we monitor and manage our systems.
  • Stability occurs any time our system state is within the set of states we deem desirable.

Observability & Controllability != Interpretation & Remediation


Imagine our job is to keep a running infrastructure deployment like a database up and performant. Essentially, given availability and performance targets, we need to develop a control model to ensure that our database hits those targets.

Observability and controllability are simply properties of the system. A control model needs to make use of the system’s observability and controllability properties to maintain stability. This involves:

  • Interpretation: Translating an observable system’s external outputs to known system states
  • Remediation: Identifying the right external inputs to adjust to return a system to a stable state

Building control models for infrastructure is hard

In your bespoke applications you have complete control over observability and controllability, and you can ensure that with enough effort you can confidently own your operations strategy (i.e. build a good control model). But with infrastructure software, observability tends to break down.

It will help to provide a more detailed description of what we mean by an infrastructure software system. In an infrastructure software system:

  • You download, configure, and run a binary created by someone else, usually a vendor or open-source community.
  • You didn’t write the code, so you are limited to the metrics, logs, and events that the authors and maintainers decided to expose.
  • Work you do to manage and maintain these systems is not differentiated.
  • Examples: Databases, message queues, load balancers, any daemon process you configure and deploy.

To operators, infrastructure often feels like a black box. You don’t control the metrics and logs the systems use to express themselves, and you need deep expertise and knowledge of software internals and source code to interpret a system’s state from the metrics and logs you can observe.

alt text

This often creates a ceiling for observability in your application stack - you can instrument your applications and services all you want, but you’ll eventually run into a black box OTS system down the line (e.g. when a trace shows a slow database query, but doesn’t give you insight into the database itself).

A framework for infrastructure observability and interpretation

We hope the following framework can help operators more easily interpret system state from the external outputs of their infrastructure. This framework can also serve as a guide for infrastructure software developers to build better observability into the systems they create. We will draw examples from our experience managing Apache Cassandra clusters. Future posts will apply this framework to other popular infrastructure tech.

Questions we want to answer about system state

We can build and understanding of a system’s state by answering the following questions:

  1. What’s going into and out of the system?

    • State inquiry example: “What type of workload is my cluster servicing?”
    • External output example: Histogram of mutation sizes
  2. What’s happening inside the system?

    • State inquiry example: “Is my write path healthy?”
    • External output example: Dropped mutation count
  3. What’s happening between the system and the operating environment?

    • State inquiry example: “Does my cluster have enough resources?”
    • External output example: CPU %

We propose a simple interpretation approach, categorizing all available external outputs into the significant Golden Indicators (throughput, latency, errors, saturation) that can help us answer the above questions. What follows is a simplified illustration of how the above framework can apply to Apache Cassandra. A future post will apply the same framework in greater depth.

What’s going into and out of the system?

Cassandra’s inputs and outputs are write and read operations, respectively. We need to observe the following metrics to get a clear understanding of this layer:

  • Throughput: Volume and types of requests
  • Latency: Request response duration
  • Errors: Volume and types of request failures
  • Saturation: Size of request queues

What’s happening inside the system?

Cassandra has multiple internal subsystems including:

  • Write: Processes that handle write requests, replication, memtable management, and flushes to disk
  • Read: Processes that handle read requests and disk reads
  • Background: Processes that compact files together and perform repairs.
  • Inter-node communication: Processes that handle node health checks and communication of cluster status updates.

Each of these subsystems has its own set of indicators that provide a picture of its state. For example, here are some of the available Golden Indicators for the write subsystem:

  • Throughput: Completed mutations, completed flush writers, Memtable live data size
  • Latency: Write request latency
  • Errors: Write request failures, blocked flush writers, dropped mutations
  • Saturation: Pending mutations, Pending flush writers

What’s happening between the system and the operating environment?

Standard operating system metrics can tell us a lot, especially when correlated with events happening at the software level. Knowing that disk saturation coincides with blocked write threads in Cassandra gives us a good indicator that our system doesn’t have enough i/o to handle our workload. Details about how to improve operating system visibility are beyond the scope of this discussion. Our focus should remain on:

  1. How to correlate the external output of infrastructure with the external output of its operating environment
  2. How to improve external output of infrastructure to give operators better signals about the interaction with the operating environment/

Using the Cassandra write path example above, we may want to correlate saturation in Cassandra’s write path to saturation of underlying resources. Saturation in the write path could be caused by excessive CPU utilization, making it difficult to accept new in-memory mutations. Write path saturation can alternatively be caused by saturation at the disk level, blocking flushes to disk.

Future work

In future posts we will apply the above framework to a variety of distributed data systems including Cassandra, Kafka, Elasticsearch and more. We will also explore topics relating to the second half of our control theory model, Controllability and Remediation. We hope this starts an earnest conversation between the communities and vendors behind these systems and the operators that run them, with a goal of improving usability and adoption and giving operators the tools and information they need to avoid operational bottlenecks.

Subscribe to our newsletter