Zero Downtime Deployment with AWS ECS and ELB

Development

Reading Time: 4 minutes

As development teams push farther toward continuous delivery, deploying updates to an application without disruption to users is constantly becoming a more sought-after practice. Amazon’s EC2 Container Service helps to make that easier than ever with tight Elastic Load Balancer integration.

Who Needs Zero Downtime Deployment?

The response to that question depends on who you ask. The most common answer used to be global websites with steady traffic twenty-four/seven or high-availability services with Service Level Agreements (SLA) that included guarantees about downtime. Everything that doesn’t fit in that box can theoretically just be deployed after hours with minimal user disruption.

As more teams move to continuous delivery/deployment with an emphasis on fast feedback, the desire grows to be able to deploy multiple times per day, in the middle of the day, and while users are active on the site. Other teams could just value their sleep. Regardless of the reason, deploying without a zero downtime process in the middle of the day will create noticeable outages for your users, damaging their confidence in your site or service. That is bad.

What Does Zero Downtime Deployment Look Like?

At a basic level, a zero downtime deploy involves swapping out servers running new code for servers running the old code on a load balancer. Here is the general scripted process:

  1. Create a new Virtual Machine (VM) image with the new code.
  2. Start a number of VMs using that image, equal to the number currently running.
  3. Verify each of these instances are running correctly and responding to checks.
  4. Add the new instances to the ELB while removing the old (with connection draining).
  5. Verify that everything is working properly. If not swap old back in for the new and diagnose the problem. If so, delete the old instances.

The process will look a little different if database changes are involved. Strict developer policies are needed to manage schema changes specifically to ensure that no deployment breaks the possibility of a rollback. As a general rule, never delete or change a column or table that is currently in use. If there is a problem, you can’t rollback without restoring from backup. Hold off on that change until the following deploy.

Zero downtime database changes are a much more involved topic that can vary simply by database stress level depending on what’s being done. But the general rule of enforcing backwards compatibility over deployments covers most of the bases. As long as nothing will break with old code a new code running side by side for a couple of minutes, you should be in the clear.

New Call-to-action

How Is It Different with ECS?

ECS doesn’t use individual virtual machines. It uses a cluster of a few to deploy Docker containers to them via task definitions. However, the basic building blocks of a zero downtime deploy are the same. We need to start the new container, verify it’s running, and then swap it out on the load balancer. This is important for a cluster because you have to have enough resources available in the cluster to start the new containers while the others are already running. If the necessary resources aren’t available, you’ll see a note in the events console that looks like this:

service sample-webapp was unable to place a task because the resources could not be found.

If you do have those resources available, you’ll see a set of messages along these lines:

service sample-webapp has started 2 tasks: task TASK-ID-1 task TASK-ID-2.
service sample-webapp registered 2 instances in elb LOAD-BALANCER-NAME
service sample-webapp has begun draining connections on 2 tasks.
service sample-webapp stopped 2 running tasks.
service sample-webapp has reached a steady state.

The messages that you see are the work of the ECS existing integration with Elastic Load Balancer to execute those zero downtime deployments without you needing to intervene. All that’s necessary, if you don’t have the resources available, is to add additional instances to the cluster so that you do. That can be done by changing the desired instances on an autoscaling group or by going directly to EC2 to add more instances to the cluster.

Try It Yourself

If you would like to step through this process yourself with Amazon’s sample-webapp, follow these steps:

  1. First, if you haven’t already, complete the Setting up with Amazon ECS process.
  2. Then step over to the ECS console’s first run wizard.
  3. Step through the ECS Getting Started guide to get the sample web app running.
  4. Hop over to the Task Definition and Create a Revision of the console-sample-app-static.
  5. Edit the JSON for “command” to change the HTML displayed in a noticeable way.
  6. Now go to Cluster, select your cluster.
  7. In the Cluster, click on your Service.
  8. In the Service, click Update and change the task definition to the revision you made in step 4. It will be indicated by the revision number next to it. Then click Update Service.
  9. In the Deployments tab you should be able to see the pending count and running counts change after a few seconds. Feel free to keep refreshing your browser tab that was pointed at the app to watch the transition.
  10. When you’re done, don’t forget to go through the Clean Up process.

That’s it!

PS: If you liked this article you might also be interested in one of our free eBooks from our Codeship Resources Library. Download it here: Deploying Docker Apps to AWS

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

  • Pingback: Cloud Development Weekly No.19 | ENUE Blog()

  • Pingback: Web Operations Weekly No.29 | ENUE Blog()

  • Srinivas Mobile

    nice article
    amazonwebservicesforum.com

  • Pingback: Ridonculous Failover Strategies in AWS - Anuj Varma, Technology Architect()

  • William Olson

    So you are saying if I update the service to use a new task-definition-revision (and say I have desired-count=2, healthy-min-count=50%, and there are 2 currently running old-revision-tasks in my cluster) then it will kill the 2 old-revision-tasks and fire up 2 new-revision-tasks automatically?

    I thought you would have to scale your desired count up to start up the new-revision-tasks alongside the old-revision-tasks, then go manually kill the old-revision-tasks after removing them from the elb target groups and verifying new-revision-tasks are working properly.

    Can you clarify that updating a service with a new task-definition-revision will absolutely kill all old tasks and start new-revision-ones immediately upon clicking update?

    Great article but seems to be lacking in ecs specific stuff like how services auto add/remove tasks to and from elb target groups and what kind of desired / min / max count of tasks should be used before during and after a rolling deploy.

    • William Olson

      Okay I just tested it out and it looks like yes the service will boot new-revision-tasks and kill old-revision-tasks as long as the max desired count will allow it and there are matching nodes (that respect constraints etc.) with available cpu/mem requirements. (It will also remove the registered instance/port target of the old task and add the new one to elb’s target groups automatically as well all via service update — even if force new deployment checkbox is not ticked).

  • Pingback: Ridonculous Failover Options in the Cloud (AWS) - Anuj Varma, Technology Architect()

  • Frondor

    This is only with the ELB, but what if I’m using my own nginx container for the load balancing?

  • Pingback: Active Active Failover on AWS ( or Azure or GCP ) – ANUJ.com – CIO Advisory, The Original Brand Name in Tech Consulting()