Software ownership is a huge part of successful development. When done right, DevOps forces developers to be ultimately responsible for the uptime and performance of their applications. This makes sense. Who better to own the results of code than the person who created it? But the code a developer writes from scratch is just the tip of the iceberg for production operations. Underneath the application is every system, server, cloud service, or in-house service that application calls to do it’s work. We broadly refer to all of these components as “infrastructure”.
If we want to encourage software ownership without interfering with developer productivity, we need to make it easier for developers to use all the infrastructure available to them without adding to their operational burden. Making developers more “ops aware” doesn’t mean making ops dominate their workday. It means finding ways to automate workflows, abstract away unnecessary complexity, and allow for greater speed without limiting flexibility. If we do our jobs right as platform engineers, developers will spend very little time considering and maintaining the infrastructure they use, and spend all their time writing the code that helps grow the business.
Infrastructure is getting more complicated and diverse. In the old days with a clear Dev - Ops split, we used to consider infrastructure to just be compute, network, and storage resources that our applications ran on. Over time things like databases and object storage were added to the list. Post-cloud and post-DevOps, where there’s a specialized system for anything, the list of components we can categorize as “Infrastructure” has become very long:
We asked both a VC community (who see a wide scope of entrepreneurs in the infrastructure space) and the software observability community (who are monitoring the full gamut of infrastructure across the industry) what they called all the infrastructure that we work with (which encompasses databases, search engines, message queues, streaming frameworks, anything that touches data). Almost everyone put it in the Infrastructure bucket, but acknowledged a growing separation of infrastructure categories in the market.
We like to conceptually bucket infrastructure by how much work it takes for a developer to “own” it as part of their application. That way we can extrapolate what types of tools and organizational support systems we need to put in place to make it easy for devs to take ownership, but not too hard for central teams to support. We see three buckets, with varying levels of effort for developers to “own”:
Compute resources have been almost completely abstracted away, and we’re firmly past the point of developers having to care about hardware. We don’t even really notice the difference between instances from different cloud providers. Sure we could say “provider X has better i/o on their SSDs than provider Y”, but for most developers, that’s pretty far down the stack of things they have to care about. Furthermore, you more or less take what you get at the compute/storage infrastructure level. The only cost-effective solutions require providers to develop huge economics of scale, so as consumers of this infrastructure, we’ve always been “price takers.” There isn’t really even an opportunity for developers to care about this layer of their stack.
Naturally this is the easiest to outsource or centralized. Pure ops headcount has been on the decline since cloud adoption started picking up, and it’s obvious that this will continue to become more and more “invisible” to developers. We consider this largely a solved problem.
A lot of first-order tasks that used to require direct interaction with compute resources are now abstracted away by cloud primitives, opening up another category of infrastructure that’s a little closer to the developer workflows. Object storage in cloud buckets are a good example, as are elastic load balancers. These were once very close to developer workflows, but they’ve found such a mass following that they have ended up influencing our behavior, and It won’t be long before these primitives become as much of a commodity as compute resources (this is happening already, there’s not much of a difference between Google cloud buckets and Amazon cloud buckets, and you can use them interchangeably).
These systems don’t need a ton of attention either. They’re almost the same as compute/storage/networking. Even if you do need in-house experts to help developers use these components, you only need a few and they can scale to the entire org.
Some infrastructure like specialized databases and search engines are very sensitive to access patterns. How developers model and code their applications has a strong effect on how the systems behave. A one-size-fits-all approach is impossible to apply, and developers always have to have some awareness of what they’re doing and how it’s impacting underlying systems. These systems are the hardest to fully abstract away, as they’re closely tied to application code. Even fully managed databases like Amazon Dynamo and RDS expose a number of tuning knobs out of awareness of the need for developer customization.
Centralizing expertise is only moderately effective for this type of infrastructure. Economies of scale can help take away truly redundant work like deployment automation, auto-scaling, and the setup of canned monitoring dashboards. But interpreting what those systems are doing under different conditions and understanding how they should be modified to meet a specific workload is not simply automatable. Ownership over this infrastructure takes deep knowledge of both infrastructure system internals and application internals.
The above categorizations are not static, and will certainly change. We can probably always categorize any infrastructure as “High/Medium/Low” effort to manage, but commoditization happens from the bottom up. What’s hard today naturally gets abstracted away and is easier tomorrow. This is a natural outcome of the growth in popularity of any solution. Everyone used to have to know exactly how their computers worked to use them correctly. Then computers go so popular we had to abstract that stuff away to get to a wider audience and solve bigger problems. Once that was abstracted away, everyone had to know how servers worked to build web and mobile applications. Again, we've abstracted most of that away with the cloud because developers demanded it (even some data services have multi-tenant cloud offerings that are very narrow in use case support, but much easier to use, requiring almost no ops awareness from developers). This is continuing with tools that let you build apps without writing any code. It’s not the silver bullet for everyone, but you can bet it will highlight a growing market that will force standards and commoditization at lower levels.
The things that seem hard today, like distributed data management and tuning serverless functions for scale are probably going to get further abstracted, and we’ll see higher order problems we have to deal with. As long as software development gets closer and closer to solving domain problems, the less time we’re going to have to deal with the minute complexities, even the ones that didn’t seem like such a big inconvenience yesterday.