At the Post we have adopted a microservice architecture to keep our teams moving fast and focused on building exciting reader features. Unfortunately, microservices come with their own complexity including an increase in servers and deployment frequency. These additional moving parts can exaggerate configuration drift in production, leading to a configuration management nightmare and ultimately reader facing problems.

Configuration Drift

Configuration drift happens when servers are updated over long periods (months, years), and the configuration files and software packages diverge among them resulting in a heterogeneous environment. Configuration management tools like Chef and Puppet can help but they don’t completely eliminate the problem. To see why, consider what happens when configuration management tools only apply additive changes like adding a software package.

  1. Server #1 is launched
  2. Package B installed by configuration management
  3. Package B is removed from configuration management
  4. Server #2 is launched
  5. Package B is missing on Server #2

Immutable Machine Images

One of the best methods for addressing configuration drift is the golden machine image strategy. The golden image is a complete operating system that includes both the system packages and the custom software packages. By versioning this machine image and never changing it after creation, it becomes an immutable building block that simplifies configuration management and software deployments. Every configuration change, from a security patch to custom software update, is deployed in the same way: by building a new machine image. This machine image can now be rolled out gradually by booting up servers with the new image. Because each image contains a complete, immutable copy of all files a rollback is quick and painless.

On AWS the standard process for applying golden machine images is to build a new AMI and then launch a new EC2 server from this AMI. While this strategy works well it does have one major weakness: a single line code change now takes 20-30 minutes to deploy to production. This effectively reduces the velocity of change to about 10 deployments a day. While that is a significant improvement for most teams, it is both desirable and possible to do much better.

Docker to the Rescue

Docker takes the golden image concept and adds several key optimizations. First, Docker divides the image into immutable layers. Each time the image is change only the layers of the image that actually have new files will be updated. This dramatically reduces the amount of data and the amount of time to deploy small changes. Second, Docker supports tags which can be used for versioning images. By combining these two concepts it is possible to deploy a single line code change to all production servers in less than 1 minute. This allows teams to move faster with less risk.

At the Washington Post we use Docker as the backbone of our continuous deployment system. On a recent project one team deployed over 60 different docker image version over a 3 day period to multiple servers with most deployments taking 2-3 minutes from commit to running in production. This type of agility shortened the project timeline and allowed the team to make smaller, lower risk changes.

How we do it

We have built two different systems which use Docker for deployment. The first system is used for stateful database services like Memcache, Kafka, and Elasticsearch. This system launches a base AMI, and uses the EC2 user data feature to run a cloud-init script which installs Docker and small number of static Docker images. Once the server is launched we never change any software or configuration. When updating packages or configuration, we build a new cloud init script with new Docker image versions that is used to launch replacement EC2 instances. By launching new instances with every change we unify our deployment process and test our recovery procedures at the same time. If we have server that behave strangely we can easily terminate the server and replace it with a new identical server.

While this system works great it suffers from the same slow startup problem as building new AMIs although it takes about half as long (10-15 minutes) from start to finish. For slow moving packages like Elasticsearch, this is generally not a problem since we don’t update them every day. For custom software that we want to deploy many times per day, even this approach is insufficient. For these types of deployments we built a custom Docker orchestration system called Nile. Nile deploys new Docker images to running EC2 servers with an average single line code change taking 2-3 minutes to reach production. We will cover Nile in more detail in another blog post.