Data platform operations done right

We’ve seen dozens of companies struggle to manage open-source data platform technologies like Cassandra, Kafka, and Elasticsearch in-house, at scale. This post discusses the challenges and opportunities we’ve seen to deal with this problem.

More tech = more problems

Managing distributed applications is hard for lots of reasons. Two big ones:

  1. There are way more moving parts in a single application deployment than ever before
  2. A lot of these moving parts are “component software”, or software someone else (like an open-source community or a vendor) packaged and released. You just downloaded it, installed it, and put it in production.

It’s one thing to instrument your bespoke applications for observability. You write the code and control what visibility means. If you want to report on something like request latency from your application, just emit an event to your monitoring infrastructure from your code. You always understand what it means when you see it show up in your dashboard. Distributed applications make this tougher (a distributed trace is a bit more involved than a simple log message), but it’s all within the developer’s control.

But when you get into the component technologies themselves (databases, message queues, and compute frameworks that back the services your application calls), things get tricky. Now you have to understand all the internals of component technologies like Cassandra, Kafka, and Elasticsearch.

The expertise tax

Nobody has the time to learn the ins and outs of every component tech their application touches, especially when dozens of different components can contribute to a single application feature. We can’t spend months planning and tuning and sizing every part of our stack to perfection before we release. Instead, we ship fast, react to issues as they happen, read docs when things really go wrong and learn as we go.

But the level of acceptable risk differs from company to company, and from team to team. Not everyone has the knowledgeable staff or the time to roll the dice on “figuring it out as they go”. Many businesses need to reach a certain level of proficiency before they’re comfortable running a new tech in production at scale. Failing to do so introduces unacceptable production risk. We call the effort to reach that level of proficiency the “expertise tax”. If you’re in this position there are two possible approaches:

  1. Centralize expertise: You pick a few technologies that your central expert teams can handle, and force fit your developers’ problems to those systems. Inevitably, your central expert teams will drown in support requests as a mismatch is created between how tech should be used, and how it is being used to achieve business goals.
  2. Make developers responsible for everything: You completely decentralize and developers use whatever tech they want, however they want. Make them responsible for their own mistakes and accept the huge risks that come with such a free-wheeling approach. No doubt, this method gets your projects released faster, but it also inevitably leads to inefficiencies and instability that can hurt the reputation of your business.

Lowering the expertise tax bill

The best organizational approach we’ve seen to manage data platforms is a combination of Site Reliability Engineering and Database Engineering. The role of Database Reliability Engineer (DBRE) is growing popular with companies who run open-source tech in-house and offer it as a service to their development teams.

One goal of the DBRE approach is to cut out the repetitive work individual DevOps teams had been doing, like building deployment automation and setting up monitoring and alerting infrastructure. Standardizing these to one set of tools makes it easy for developers to provision and modify clusters in development.

Responsibilities are split, with developers being responsible for their application’s usage of a service like a Cassandra cluster, and the DBRE team available to enforce best practices, consult on architecture, and help resolve production issues in a pinch. These are a few important things a DBRE has to do:

  • Build deployment automation
  • Capacity planning
  • Enforce best practices & debug production issues
  • Performance tuning
  • Integrate systems with monitoring and alerting infrastructure

This approach is promising but not all of the work a DBRE does can just be built once and maintained like deployment automation. Things like debugging, tuning, and capacity planning all require deep human expertise. A single expert can only scale so far, and how many DBREs can a company really find and keep? They’re hard to get, especially when you’re priced out of the market by tech giants with endless cash. And doing this for all the backend technologies your team wants to use multiplies the complexity tenfold. Fortunately there’s a solution to this, and in a later blog post we will discuss how to use AI to help teams manage any system at scale.

If you’re looking to learn more…

At Vorstella, we’re building A.I. that helps your team run component technologies like databases, message queues, and search engines in-house, at scale. We apply various machine learning methodologies including classification algorithms, regressions, decision trees, and ensemble learning to help you debug and tune your clusters, enforce best practices, and plan for scale.

Email us at [email protected] for more information.

Written by Avinash Mandava