We’ve seen dozens of companies struggle to manage open-source data platform technologies like Cassandra, Kafka, and Elasticsearch in-house, at scale. This post discusses the challenges and opportunities we’ve seen to deal with this problem.
Managing distributed applications is hard for lots of reasons. Two big ones:
It’s one thing to instrument your bespoke applications for observability. You write the code and control what visibility means. If you want to report on something like request latency from your application, just emit an event to your monitoring infrastructure from your code. You always understand what it means when you see it show up in your dashboard. Distributed applications make this tougher (a distributed trace is a bit more involved than a simple log message), but it’s all within the developer’s control.
But when you get into the component technologies themselves (databases, message queues, and compute frameworks that back the services your application calls), things get tricky. Now you have to understand all the internals of component technologies like Cassandra, Kafka, and Elasticsearch.
Nobody has the time to learn the ins and outs of every component tech their application touches, especially when dozens of different components can contribute to a single application feature. We can’t spend months planning and tuning and sizing every part of our stack to perfection before we release. Instead, we ship fast, react to issues as they happen, read docs when things really go wrong and learn as we go.
But the level of acceptable risk differs from company to company, and from team to team. Not everyone has the knowledgeable staff or the time to roll the dice on “figuring it out as they go”. Many businesses need to reach a certain level of proficiency before they’re comfortable running a new tech in production at scale. Failing to do so introduces unacceptable production risk. We call the effort to reach that level of proficiency the “expertise tax”. If you’re in this position there are two possible approaches:
The best organizational approach we’ve seen to manage data platforms is a combination of Site Reliability Engineering and Database Engineering. The role of Database Reliability Engineer (DBRE) is growing popular with companies who run open-source tech in-house and offer it as a service to their development teams.
One goal of the DBRE approach is to cut out the repetitive work individual DevOps teams had been doing, like building deployment automation and setting up monitoring and alerting infrastructure. Standardizing these to one set of tools makes it easy for developers to provision and modify clusters in development.
Responsibilities are split, with developers being responsible for their application’s usage of a service like a Cassandra cluster, and the DBRE team available to enforce best practices, consult on architecture, and help resolve production issues in a pinch. These are a few important things a DBRE has to do:
This approach is promising but not all of the work a DBRE does can just be built once and maintained like deployment automation. Things like debugging, tuning, and capacity planning all require deep human expertise. A single expert can only scale so far, and how many DBREs can a company really find and keep? They’re hard to get, especially when you’re priced out of the market by tech giants with endless cash. And doing this for all the backend technologies your team wants to use multiplies the complexity tenfold. Fortunately there’s a solution to this, and in a later blog post we will discuss how to use AI to help teams manage any system at scale.
At Vorstella, we’re building A.I. that helps your team run component technologies like databases, message queues, and search engines in-house, at scale. We apply various machine learning methodologies including classification algorithms, regressions, decision trees, and ensemble learning to help you debug and tune your clusters, enforce best practices, and plan for scale.
Email us at [email protected] for more information.