Easy Blue-Green Deployments on Amazon EC2 Container Service

Development

Reading Time: 8 minutes

Amazon EC2 Container Service (ECS) is Amazon’s solution for running and orchestrating Docker containers. It provides an interface for defining and deploying Docker containers to run on clusters of EC2 instances.

The initial setup and configuration of an ECS cluster is not exactly trivial, but once configured it works well and makes running and scaling container-based applications relatively easy. ECS also has support for blue-green deployments built in, but first we’ll cover some basics about getting set up with ECS.

Within ECS, you create task definitions, which are very similar to a docker-compose.yml file. A task definition is a collection of container definitions, each of which has a name, the Docker image to run, and options to override the image’s entrypoint and command. The container definition is also where you define environment variables, port mappings, volumes to mount, memory and CPU allocation, and whether or not the specific container should be considered essential, which is how ECS knows whether the task is healthy or needs to be restarted.

You can set up multiple container definitions within the task definition for multi-container applications. ECS knows how to pull from the Official Docker Hub by default and can be configured to pull from private registries as well. Private registries, however, require additional configuration for the Docker client installed on the EC2 host instances.

Once you have a task definition, you can create a service from it. A service allows you to define the number of tasks you want running and associate with an Elastic Load Balancer (ELB). When a task maps to particular ports, like 443, only one task instance can be running per EC2 instance in in the ECS cluster. Therefore, you cannot run more tasks than you have EC2 instances. In fact, you’ll want to make sure you run at least one less task than the number of EC2 instances in order to take advantage of blue-green deployments. Task definitions are versioned, and Services are configured to use a specific version of a task definition.

What Are Blue-Green Deployments?

Blue-green, black-red, fuchsia-periwinkle, it really doesn’t matter what colors are used. The point is that there are two separate but equal environments.

At any given moment, your application is running on one of the environments while the other environment is either destroyed to conserve resources or sits idle waiting for the next update. This second environment allows you to deploy updates without interrupting the currently live environment. After the deployment is ready, the load balancer/router can be updated to route traffic to the new environment.

This concept is not new, but it has not been widely adopted due to the requirement of a second environment. Depending on the size and complexity of your application architecture, a second environment can be quite costly and difficult to manage. Utilizing Docker containers and a microservices architecture can help alleviate this challenge a bit. Using ECS for managing containers on EC2 can further ease this burden.

Using ECS for managing containers

How Amazon ECS Handles Blue-Green Deployments

ECS facilitates blue-green deployments when a service is updated to use a newer version of a task definition. When you define a service and set the number of tasks that should be running, ECS will ensure that many are running, assuming you have enough capacity for them. When a service is updated to use a new version of a task definition, it will start new tasks from the new definition, as long as there is spare capacity in the cluster. As new tasks are started, it will drain connections from the old tasks and kill them off.

Looking at a most basic example, consider having two EC2 instances in an ECS cluster. You have a service defined to run a single task instance. That task will be running on just one of the EC2 instances. When the task definition is updated and the service is updated to use the new task definition, ECS will start a new task on the second EC2 instance, register it with the ELB, drain connection from the first, and then kill the old task.

As I mentioned earlier, you’ll want to make sure you have at least one extra EC2 instance available in the cluster than the number of tasks you have set in the service. If in this basic example we had two tasks running, there would be one on each of the EC2 instances, there would be no spare capacity for ECS to start a new one, and therefore a blue-green deployment could not happen. You would have to manually kill at least one of the tasks to start the process.

It is also worth noting that every time ECS starts a task, it will pull the Docker image specified in the definition. So when you build a new version of your images and push them to a registry, the next task to start in ECS will pull that version. So from a continuous integration and delivery standpoint, you just need to build your image, push to registry, and trigger blue-green deployment on ECS for your updated application to go live.

Below is a series of diagrams that illustrate a simplified blue-green deployment process on ECS.

  1. To begin, we have a single service running two tasks. The two tasks are split between EC2 Instance 1 and EC2 Instance 2. Tasks split between two EC2 instances
  2. An updated task definition has been created, and Service has been updated to use the new task definition. ECS launches a new task on EC2 Instance 3 and begins draining connections from previous tasks. ECS launches a new task
  3. As connections are drained from existing tasks, ECS will kill one at a time and launch additional tasks until the desired count is met. ECS killing tasks and launching new ones
  4. When ECS has met the desired count of tasks running, it kills any remaining tasks that were still running the previous version of the task. ECS has met the desired tasks

And that’s it. The updated version of the application is running in a new “green” environment. With ECS, the concept of separate blue and green environments is a bit virtual and fluid, but since containers are isolated, it really doesn’t matter.

ECS Deploy: A Simple and Elegant Way to Trigger Blue-Green Deployments

Triggering blue-green deployments on ECS is quite simple: Create a new version of a task definition (no changes required) and update service to use new definition. Doing this manually every time you want to deploy is a bit of a nuisance though, especially if nothing needs to change about the task definition.

As a development team we like to operate a continuous integration and delivery process that allows us to easily trigger deployments by merging code against appropriate branches in a git repo. A merge against develop means it should be deployed to our staging environment, and a merge against master means it should be deployed to production. We don’t want any further manual processing other than the merge and push to git.

Our continuous integration process clones our repo, builds our Docker images, executes unit tests against the image, pushes the image to our private registry (which runs on ECS), and finally triggers a blue-green deployment on ECS. When we looked for solutions for triggering the update/deployment on ECS, the options were complicated. We could use Amazon’s CodeDeploy, or Elastic Beanstalk, but those required a different build process that did not match what we were running in CI.

Sign up for a free Codeship Account

Since all that is required to trigger a blue-green deployment is an update to the task definition and service, we wrote a shell script that takes a few parameters and then works with the AWS command line tools to fetch the current task definition, create a new version from it, and update the service to use it. It works quite well and is very simple. After triggering the update, it monitors the service to be sure it is running the updated version before exiting. If it sees the new version running, it will exit with a normal zero status code; otherwise it exits with an error. This way, our CI/CD process knows whether or not deployment was successful, and we can be notified of failed deployments.

By the way, our script is available open source with an MIT license.

ecs-deploy is available both as a shell script and a Docker image. The script uses Linux utilities like sed, which do not behave the same on Linux and Mac, not to mention Windows. Using the Docker image may give you more consistency.

Requirements for ecs-deploy

ecs-deploy makes use of other software to perform its work. Notably it uses the AWS CLI tool, commonly installed via pip by running pip install awscli. It also uses jq which is a command line JSON parser.

While the script does not require you to set any environment variables, it is highly recommended that the AWS API credentials be set this way in order to keep them out of your shell history and process list. The AWS region can also be set via environment variable in order to keep command line options to a minimum.

Using the shell script

If you’ve cloned the repo or downloaded the ecs-deploy script into your path, you can just run it to get full usage options. Here’s an example:

$ ecs-deploy -c clusterName -n serviceName -i repo/name:tag

That example assumes you’ve configured environment variables for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION.

Using the Docker image

If you don’t want to install jq and AWS CLI (or the dependant Python tools like easy_install), you can just run the Docker image.

The best way to use the Docker image is to clone the ecs-deploy project repository and use the docker-compose.yml configuration provided. By using Docker Compose to run the image, you can provide the AWS-related environment variables via a file to keep them out of the command line arguments. When you clone the repository, copy the local.env.dist file to local.env and add your credentials into the file. Then you can use docker-compose run ecsdeploy to run the image.

The Docker image uses the entrypoint of the ecs-deploy script, so you just need to provide the arguments in the same way as you would for the shell script. Here is an example:

$ git clone https://github.com/silinternational/ecs-deploy.git 
$ cd ecs-deploy/ 
$ cp local.env.dist local.env (edit local.env to add your credentials and default region) 
$ docker pull silintl/ecs-deploy:latest 
$ docker-compose run --rm ecsdeploy \ -c clusterName -n serviceName -i repo/name:tag

If you want to incorporate ecs-deploy into a docker-compose project of your own, you can just add another service with this:

ecsdeploy: 
  image: silintl/ecs-deploy:latest 
  env_file: 
    - local.env

Be sure to have an env_file configured with your AWS credentials for safe operation.

In Conclusion

Blue-green deployments provide a great way to minimize production impact during a release, and Amazon’s EC2 Container Service simplifies many of the complexities involved. I recognize that our use case is relatively simple and that larger and more complex applications may not be as easy to deploy in this way, but it is absolutely worth investigation. The comfort we have in automating deployments triggered by code changes has really changed our behaviors and development processes for the better. It makes us much more agile, and our developers are happier not having extensive build and release procedures.

We have found our ecs-deploy script to be very helpful, easy to use, and reliable for deployments, and I hope you can benefit from it too. We’d appreciate your input on improving it and welcome pull requests for new features. Post your comments and questions below to keep the conversation going.

References

PS: If you liked this article you might also be interested in one of our free eBooks from our Codeship Resources Library. Download it here: Running a Private Docker Registry on Amazon EC2

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

  • Whilst it complicates things quite a bit, we can actually define Tasks to use a smaller than 100% vCPU share: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html

    I completely agree that it’s a great idea to have tasks spread across multiple EC2s, but it doesn’t necessarily have to be 1 Task per 1 EC2.

    • pshipley

      @jokeyrhyme:disqus thank you for your comments and you are right, it doesn’t have to be one task per EC2 instance. We run several tasks per instance, but only one task per service per instance if that makes sense. With ELBs you configure listeners which listen on a single port and can send traffic to a single port across multiple EC2 instances, we can only have a single task bind to a given port per EC2 instance. For organization purposes we have decided to designate blocks of ports to services, so for example our service named Doorman has the 17000 block assigned so its containers can bind to 17080, 17022, 17110, etc. as needed on any EC2 instance within our ECS cluster and we can configure the ELB to listen on 443 and send traffic to 17080 on any instances where it is available. So in one ECS cluster we may have 5 EC2 instances and run 10 Services across them. So some instances have several services represented with individual tasks for each while other instances may only have a single service/task running when a task requires the full CPU/memory allocation. This all gets a bit complex and confusing though so I kept the examples above as simple as possible.

  • Paul Seiffert

    Good read, thanks for writing this up!

    From my perspective, the approach you’re describing is more of a Rolling-Update deployment instead of Blue-Green. Most people (and literature) refer to Blue-Green deployments as a deployment process which first sets up a complete second instance of the service and then makes one switch on hardware / load-balancer level.
    We’re working with AWS ECS as well and I can tell that Blue-Green deployments aren’t that easy with ECS (because we tried it, failed, and decided to switch to Rolling update ;-) ). Our deployment process first involved creating a new ECS service for every new software version, wiring it with a new ELB (bad bad practice!) and then making the switch via DNS (even worse practice!). This approach wasn’t the right one because 1) we didn’t reuse ELBs and 2) switching via DNS is unpredictable.

    Making the switch via the load balancers is not really an option with AWS ECS because ELBs can’t balance the same incoming port to different instance ports which would be required if you don’t want to have twice as many EC2 instances as max(#tasks of any service).

    We ended up doing rolling updates as well. Our tool that deploys new versions also registers a new task definition und calls update on the ECS service. One difference you might want to think about is that we wait not only for one running task of the new task definition but for the whole update to be completed. This ensures that your cluster provides enough resources for the complete new deployment. You can check whether the update could be completed successfully by calling describe-service and counting the deployments included in the response. During an update, there are two deployments. Once it is finished, there should only be one deployment left.

    • pshipley

      That is great @Paul Seiffert:disqus, thanks for the suggestion on counting the tasks. For your comment about the ELBs balancing across different ports, I saw a great presentation at DockerCon from a guy using Interlock, https://github.com/ehazlett/interlock, to dynamically configure haproxy to distribute load over multiple tasks on the same EC2 instance bound to different ports. It depends on the use of Docker Swarm though so not directly useful with ECS, but quite interesting if that is of significant value for you.

    • Ari Pringle

      I’m curious about rolling upgrades and the implications of having old and new versions of the application running simultaneously. For example, if running 4 live instances normally, with 1 spare instance for rolling upgrades, you will have a period of time where some instances will be running the old version of the application, and some running the new. This may be fine for APIs where only a single request is made, but if the application has multiple resources that need to be pulled, the user could end up pulling mixed versions of resources and get a broken application during this window. Any suggestions on dealing with this? I’m wondering if enabling sticky sessions on ELB would take care of this by making sure a user doesn’t pull resources from old & new simultaneously.

      • Paul Seiffert

        Yes, sticky sessions will help you with that.

      • magheru_san

        What we do in addition to sticky sessions is we keep that kind of stuff outside the instances as much as possible.

        For example all our static content for all our previous versions is stored in S3, also cached through CloudFront, and each new software version only will point to the content it needs.

        • pshipley

          We do the same by pushing static assets to S3 and we also version them, so css files get named something like styles.aj23ca1.css or whatever as part of the build process.

          The main issue I see with multiple versions running simultaneously would be if you make backward compatibility breaking changes to your database and then the older version could not run. In that case you’d need to tweak the deployment process a bit to account for it.

  • Pingback: AWS Week in Review – September 7, 2015 | SMACBUZZ()

  • Pingback: How to setup continuous deployment with anynines and Codeship.com | anynines blog()

  • Pingback: Easy Blue-Green Deployments on Amazon EC2 Container Service | The Boring Blog of Phillip Shipley()

  • Pingback: 24-09-2015 - Links - Magnus Udbjørg()

  • Kenton Kenton

    Many of the images are broken.

    • pshipley

      They appear to be working for me, perhaps you’re having other network issues?

      • Kenton Kenton

        Looks like they’re blocked as mixed content in Chrome because they’re loaded from http instead of https when accessing the https version of this article.

  • Pingback: Docker makes upgrading to PHP7 easy – The Boring Blog of Phillip Shipley()

  • M G

    “!New Call-to-action” ;)

  • @pshipley:disqus Thanks for the article. How do you define your different environments? I am new to ECS–I had previously been deploying my Docker containers to EC2 with Ansible, and I just used a different EC2 instance for each environment. In ECS, do you define a whole different cluster for each environment? My app is small (three containers–Rails, Postgres, nginx) and can run comfortably on one t2-small. Would I have a different cluster for each environment, with one box each? Seems not Docker-like. I’d be much obliged for any advice.

    • pshipley

      Hi @disqus_NjBeUOI5JP:disqus , I’m glad you found this helpful. We do have separate ECS clusters for staging and production and in some cases separate production clusters for specific apps where we want extra security. The general principle for containers and environments is to group “like-security” containers in the same cluster. So if you have several apps with similar security concerns it is probably okay to run them on the same ECS cluster. However if you have some high security applications you may want separate dedicated ECS clusters for them.

      So, to your question about your app, you could possibly run them both on the same single box ECS cluster. You could even shut off your test/staging service (set desired tasks to zero) when your not actively developing/testing it. It IS possible for containers to affect each other though so you’ll have to decide if you can take the risk of test/staging container taking away resources from a production container. On ECS when you specify the CPU limitation that is really just a scheduling number and if the container needs more and more is available ECS will allow it to use more.

      Hope that helps, let me know if I’ve created more confusion :-)

      • Not at all, thanks for the reply. I may have misspoken above–I can run all three _containers_ on one box, but we would certainly only have one _environment_ per box, to avoid having staging and prod compete for resources.

        But that was where I was struggling. In my EC2-only environment, I just had one EC2 per environment–one for staging, one for prod, one for validation–and I could target one instance or another by specifying it in my Ansible call. But ECS only lets me target a cluster and not an individual box. So, if I want to keep deploying my app onto a single box, it looks like I need to create one cluster for each environment (prod, staging etc) and assign one box to each cluster.

        Does that sound like the correct next step? Thanks for all your help!

        • pshipley

          Yep that sounds good. I believe though by only having a single box and a single “task” running you will not be able to do blue/green though because there is no spare capacity to run the new version on before killing the old. So you’d have downtime with each update because you’ll have to stop the current task for it to launch the new one, but that typically takes less than a minute so maybe its acceptable for you.

          • Right, no blue/green, alas, which is why we’re here in the first place! :) Here’s a question though: if I did blue/green, how would I persist my postgres data from one instance to the other? Or would I only blue/green the app and not the database? Kind of wandering far afield of the original topic but I thought I’d ask. :)

          • pshipley

            Ahh, data persistence, the big one. It’ll cost a little more but I’d move the DB to RDS and let them manage persistence for you and then downsize your EC2 instance to a nano, get a second one so you have redundancy, and configure your service to consider a healthy percentage of 50% so that ECS can perform the blue/green deployment for you :-) I’m all about finding ways to scale down and save money, but DBs and persistence is important enough to pay a little more to have it managed well for you. Check out my other article about scaling down: https://blog.codeship.com/non-profit-case-docker/

          • You’re awesome, thanks so much!

  • Raul Guerrero

    This is a very nice article, with a good design and explanation on how to avoid downtime while updating your ECS.

    Now, I’m using Elastic Beanstalk, which automatically sets up ECS plus a bunch of other stuff, how do I achieve the same thing but through EB instead of directly accessing ECS?

    Thanks!

  • Justin Menga

    This is not blue/green but rolling as has been stated in the comments below. The main reasons for blue/green deployments come from the traditional world of “mutable” infrastructure where you modify the state of a server whenever you deploy a new application, and history has shown it’s risky to roll back, so you stand up new servers with the new release and then cut over/rollback in a single operation.

    In an immutable infrastructure world of containers (and given ECS task definitions are immutable locking the environment configuration as well), rolling deployments have virtually no risk (the only risk comes with ensuring you have sufficient resources to deal with temporary surges in containers, but ECS service deployment thresholds give you some control over this).

    Combine this with continuous delivery (i.e smaller, more frequent releases = lower risk) and the requirement to do the traditional blue/green switchover is removed and rolling deployments are a simpler and equally as effective mechanism.

    All of the above assumes you are adopting n+1 backwards compatibility, which is required regardless of blue/green or rolling.

    • Sergej Jevsejev

      Yep, this is more like rolling update, not the blue green true green blue deployment.

      Sadly, the title here sounds like a clickbait :(