Deploying a new version of a product to production is one of the most critical moments in the Software Development Lifecycle. It can go from sheer excitement to release the latest features, to a nightmare of cascading failures and outages. In this series of posts, we will explore deployment techniques that can be used to deploy a new version of an application without causing disruption to end users.
It is time to deploy a new version of the product. All the development work is done, the tests have passed and the stakeholders have approved it. Now we need to schedule a maintenance window in the middle of the night on a weekend, stop traffic to all servers, update the software following a lengthy procedure with lots of manual steps, restore traffic to the servers and hope everything works.
Sounds risky, right? This has been the standard way of doing deployments for a long time. Fortunately, there are better ways. On this Zero Downtime Deployment Techniques series of posts, we will highlight three of the most common ways of deploying software without downtime. In our last post we covered the Rolling Updates technique and now we will talk about Blue-Green deployments and the next one will be about Canary deployments.
A Blue-Green deployment is a relatively simple way to achieve zero downtime deployments by creating a new, separate environment for the new version being deployed and switching traffic into it. A rollback happens just as easily, with a traffic switch to the old version. As with any deployment technique, there are advantages and disadvantages to it, as we will cover in the next sections.
How it WorksIn a Blue-Green deployment, the current version of an application, already in production, is running on top of an environment which we call “Blue”. When it is time to release a new version, it is first deployed in a different, separate environment called “Green”, as identical as possible to the Blue environment. After the new environment is validated, 100% of the traffic is redirected to it and the old environment eventually becomes idle. In case of critical errors, the deployment can be rolled back by redirecting traffic back to the Blue environment without major impediments. If no problems are identified, the Blue environment is terminated and the deployment procedure is complete.
A Blue-Green deployment with 3 instances on each environment
After traffic switches to the Green environment, there may still be sessions or keep-alive connections on the Blue environment and there are several approaches to drain them. The best practice is to wait for the sessions to naturally terminate, instead of abruptly closing them during the turnover.
One of the main reasons companies choose the Blue-Green deployment strategy is because it is easy to rollback to previous versions. If a problem is identified in the Green version before the Blue environment is terminated, traffic can simply be redirected back to Blue. In case the infrastructure for the Blue version has already been terminated, a new deployment can be triggered using the old version.
Possible ChallengesWhen implementing this technique, there can be some challenging situations depending on the nature of your workload. In this section, we will list common problems and possible solutions.
When doing a Blue-Green deployment, you will need to keep 2 separate instances of your infrastructure up at the same time, at least for the duration of the deployment but possibly for more time to allow for old sessions to drain or a quick rollback to the previous version. Having 2 copies of your production environment can be very costly, especially if you use Virtual Machines since they can take more time to start and stop.
Similar to the challenge with Rolling Updates, If Blue-Green is used to deploy stateful applications, transient information that is stored in the instance (like user sessions, cached files, etc.) might be lost when traffic switches to the Green environment.
If it is not feasible to store this information outside of the instance, a possible solution is to keep the Blue environment up until it finishes processing ongoing requests or sessions. New requests and sessions will be routed to the Green environment.
Although it is possible to apply Blue-Green deployments to database changes, this is usually not implemented due to the complexity of managing data migrations and the amount of data that may need to be transferred from one environment to another. Instead, database changes are usually handled with extra care to ensure that they work with the current and new versions of the application. It is very important to test the new schema with both versions to be prepared for an application version rollback or, if having a common schema is not possible, rollback scripts must be created to be executed if necessary. We will have a separate post addressing techniques to ensure database schema backward and forward compatibility.
In order to support rollbacks, the old version of the application must be able to work with data created by the new version. This usually means that the old version must be able to handle extra fields in a database table or event schema without crashing. For instance, if an API changes and includes an extra field, it is important to ensure that the previous version will still work the new API format. Also consider using feature flags to decouple the release of a feature from the deployment time.
When to Use
Blue-Green deployment is a risk mitigating technique, as it provides a fast way to rollback in case of critical errors and a safe environment to test the new version. It also doesn't require that the architecture is changed in order to use it. With that in mind, it's a good option in projects that currently have low maturity in terms of deployment, and a good first step for the adoption of more advanced techniques like Canary deployments.
You should use this technique when you wish to have no downtime when deploying a new version and you can tolerate a hard switch of traffic from one version to the other. You should also consider this technique when you want to test the new version in the production environment before sending real traffic to it.
If you have data changes that are self-contained in the new version, you can make the code significantly simpler when using Blue-Green deployments because you do not need to worry about backward compatibility. This is only true when the data is not shared between the versions.
When compared to other deployment strategies like Rolling updates, the main difference is that with Blue-Green you will only send production traffic to the new version after validating it in the production environment, preventing possible bugs that affect user requests. When compared with Canary deployments, the difference is that only one version (either Blue or Green) will receive user traffic, while on Canary deployments you have a percentage of the traffic going to one version and the rest going to the other version.
The Blue-Green deployment strategy is a relatively simple technique that provides zero downtime deployment and a safe way to test the code before sending production traffic to it. It can be implemented as the final deployment solution or as a first step towards more complex techniques like Canary deployments. While implementing this technique, keep in mind these key points:
- Introduce tests to ensure the application is forward compatible. It may be necessary to change code to support extra parameters on database tables and event schemas.
- When switching traffic from one environment to another, be careful with DNS changes since they can take some time to propagate, both internally and externally.
- Implement a test procedure to validate the new version in the Green environment before switching traffic to it. Take extra care to ensure that the tests will not affect the active version of the application.
- Pay attention to infrastructure costs and try to minimize the time that both environments are up at the same time.
This is the second article on a series being written by Isac Sacchi Souza, Principal DevOps Specialist, Systems Architect & member of the Daitan Technology Council. Thanks to João Augusto Caleffi, João Sávio Ceregatti Longo and the SRE/DevOps Community of Practice for reviews and insights.
- Schenker, Gabriel N et al. Getting Started with Containerization. Packt Publishing, 2019.