Elastic recently launched new Kubernetes helm charts into alpha. While testing them we found seven missing config settings that would have caused problems for our production deployment. In this post, we identify and explain the missing settings and how to set them correctly. Steps to reproduce are at the end of the post.
What we found when testing Elastic’s helm charts
We’ve been running Elasticsearch in Kubernetes for ourselves and our customers for years. When we saw that Elastic announced an alpha release of their helm charts we were curious to see how they chose to address several common issues and challenges that arise from running Elasticsearch in a containerized environment. We followed the instructions in the README (our steps to reproduce are at the end of this post) and deployed.
Config validation is just part of the work we’re doing to help database operators level the playing field when it comes to running complex distributed systems. The first thing we did was plug in our collector to validate that the topology and configuration settings adhere to best practices.
As soon as we deployed the Elasticsearch cluster, we got the following alert from our system:
There’s a lot going on there so let’s break it down:
1. Reported Minimum Master Nodes inconsistent
A “split brain“, the accidental creation of separate clusters from the nodes you expected to form a single cluster, is one of the most disastrous scenarios an Elasticsearch user can encounter.
Setting the number of master-eligible nodes (i.e. quorum) appropriately is the way to avoid this. Having nodes in your cluster that don’t agree about what that number should be is a sign of problems in your configuration management that should be resolved right away.
2. Bootstrap Memory Lock Consistent
Performance in Elasticsearch depends heavily on memory management. If the operating system is allowed to swap Elasticsearch’s heap memory to disk, your search performance is going to stall while necessary information is retrieved back from disk. To ensure this doesn’t happen, your systems should be configured to disallow this swapping, so we alert on inconsistent setting of the memory lock across the cluster.
3-7. Gateway Expected Data Nodes, Gateway Expected Master Nodes, Recover After Data Nodes, Recover After Master Nodes, Recover After Time
Following a full cluster restart, it’s important that any master-eligible node (which may be elected to become the source of truth on cluster topology and index distribution) has available the latest copy of topology and index distribution information prior to the cluster shutdown. Collectively, these settings control when data nodes attempt to recover indices from disk. Configured appropriately, each node will attempt to recover those shards that the latest cluster state indicates were allocated to that node. For this reason, it’s best practice to provide these details uniformly to each node via configuration files.
With that information in hand, we felt confident adding additional configuration options to the helm chart and that we had achieved a cluster configured for best practices.
Steps to reproduce
Deploying with helm
We used Elastic’s helm charts in alpha.We started with the README, configuring the initial values for three master nodes and three data nodes. That required two values override files – one for each set of node types (see the“multi” example for details):
accessModes: [ "ReadWriteOnce" ]
accessModes: [ "ReadWriteOnce" ]
Just a few helm commands later, we had a functioning ES cluster:
helm template elasticsearch --values master.yaml | kubectl apply -f -
helm template elasticsearch --values data.yaml | kubectl apply -f -
Installing the Vorstella Plugin
Below is the Dockerfile for our Elasticsearch container. Note how we’re installing the Vorstella collector (you need an API key to use Vorstella, just contact [email protected] and we’ll set you up ASAP).
RUN set -e && \
elasticsearch-plugin install --batch https://downloads.vorstella.com/vorstella-es-metrics-plugin-2.3.2.zip
Setting the missing configuration values
We added the following configs to the helm chart to align the configuration with best practices:
- name: discovery.zen.minimum_master_nodes
- name: bootstrap.memory_lock
- name: gateway.expected_master_nodes
- name: gateway.expected_data_nodes
- name: gateway.recover_after_master_nodes
- name: gateway.recover_after_data_nodes
- name: gateway.recover_after_time