Thanks for popping by, I’m Tariq Khurshid and lead on the website (www.tfl.gov.uk), Service Desk, Change & Release Management. In this blog I’d like to share with you the success we have enjoyed using the “blue/green” approach for software release and deployment in the new website.
This is a summary of our experience doing blue/green software deployments and releases to tfl.gov.uk using Amazon Web Services (AWS) cloud infrastructure.
Blue/green deployment of software to the website is a process that we use to safely release new versions of (www.tfl.gov.uk) without any down time or outages for customers.
The key to success is to maintain two identical production environments to switch between. As (www.tfl.gov.uk) is now hosted on virtual servers in the cloud, this is relatively easy and cost effective.
Blue/green deployment allows us to develop software to a high standard, test independently of the live site and easily package and then deploy to live. This means we have the ability to rapidly, reliably and repeatedly push out enhancements and bug fixes to (www.tfl.gov.uk) at low risk, with minimal overheads, and best of all,….. no outages for customers.
In order to achieve a single click/deploy, automated solution for releases in a blue/green continuous delivery environment; we decoupled monolithic CloudFormation stacks into a modular format, and deployed them in parallel with a scalable Puppet Enterprise Infrastructure. This was not easy, as the design is extremely advanced, based on an agile architecture, so we had to trail-blaze.
By our cloud providers own admission, www.tfl.gov.uk, is currently one of the most technically advanced cloud formations for a public transport website in the world.
One of the challenges with blue/green, is the cut-over, taking software from the final stage of testing to live production. We do this by doing a Domain Name Server (DNS) switch using Route53 between our two identical production environments; Pre-production (blue) and Production (green)
As we prepare a new release of software, we do our final stage of testing in the pre-prod environment. When the software has passed all regression and quality tests and checks in pre-prod, we switch the DNS, so that all incoming requests go to the pre-prod (blue) environment. The old production (green) environment, is now idle and ready to be used as a back-up or roll-back if there are any issues.
Blue/green switch and roll-backs
In terms of the actual switch, our experience has shown that the DNS (Domain Name Servers) propagation using Amazon’s dynamic Route 53 tool between is very quick, with virtually no discernible impacts to users. However, some users may experience transfer anomalies as the network changes take effect which are usually transient and a refresh of the cache (F5) or re-start of the browser quickly resolves any problems.
Roll-backs are pretty much instant, however do entail some additional DB synchronisations to capture any user submitted data during the switch and switch back.
Cloud based hosting
Blue/green deployment is cost efficient for us as when we need more capacity, or a parallel pipeline, with a couple of clicks, we can automatically leverage the scalability and elasticity of AWS for “on-demand” cloud infrastructure, and spin up multiple test and integration environments. This provides the agility TfL needs, as we are no longer limited by our production website and systems being deployed in a physical DataCentre. When we are not using the production environment and it becomes pre-prod (“blue”), we immediately spin down the infrastructure to default non-live size which is very cost effective.
We currently have around 10 different on demand cloud environments (e.g. blue/green, development, testing and projects) but we only really need four for blue/green, three of which can be scaled down to minimum infrastructure size as they are for development and test purposes. Cloud size is based on website traffic and the live website can automatically scale up/down according to customer demands so we only pay for infrastructure that we actually need.
We are going through a cultural shift using hosting in the cloud, where often times it’s more cost effective to throw away a poorly performing virtual server/system,- spin up a new instance, load test and quickly make live. We are still in the mind-set of let’s fix it, trouble-shoot, and work it out, or wonder if this is a more personal, “man vs code” ?
In our context, blue/green deployment is an enabler for continuous delivery, so we continuously prove our ability to deliver new code or functionality, by continuously treating each release package for the website as if it could be deployed to the live website. We do this by progressing the release package via a parallel deployment pipeline and a series of build/test-deploy cycles that safely prove suitability and optimise the release ready for deployment. At the end of the pipeline, barring a couple of manual steps, we can deploy automatically to our production website, i.e. continuous delivery, or we can make a business decision on next steps via a Change Advisory Board (CAB).
Database’s and blue/green
All schema changes are done during a full deployment which loads all of the data as part of a batch loading process. This includes data such as tube status, bus predictions, bus routes..etc., which run independently in each environment, so don’t require the data to be synchronised as part of the blue/green deployment. Data collected by users submitting information via the TfL site, will have their data synchronised between the environments before and after the blue/green switch using automatic scripts.
Any user submitted data usually stays in the same table structures that rarely change. If there are changes to a table, when the tables/columns are synchronised, any that aren’t matched during the synchronisation because they have been removed/added can be ignored. We soon plan to move to a new centralised RDS design of database (DB), so that we no longer have to synchronise DBs when we do blue/green, making the process smoother and more seamless.
The great thing about blue/green is that it takes away traditional time pressures on a Release team, to quickly deploy code in a limited maintenance window or outage, because there is none !
Currently we aim for a release cycle every two weeks on bug fix/new code and also complete a weekly standard website reference data refresh. Release packages are built in collaboration with the business stake-holders, project managers, development and test teams. When end-to-end testing has been completed, right up to pre-prod, the release package is reviewed in a Change Advisory Board (CAB) for a Go/No-Go decision.
Technically complex, but worth it
Cloud based hosting can be tough, and it has been technically complex and challenging. A real life case study revolved around a blue/green deploy we did without pre-warming the Elastic Load Balancers (ELB). After the DNS switch we immediately saw the maintenance holding page pop up. We tried trouble-shooting live, however decided to instantly roll-back, which we did within minutes and all services were immediately restored. We followed up with a root cause analysis conference call with our cloud partner (AWS). We have now reconfigured both ELB (Production and Pre-prod) configurations, across three different AWS availability zones (AZ), with fixed auto-scale thresholds to prevent this happening again.
Benefits of the blue/green approach using cloud based hosting;
1. Reduces risk by allowing time for full regression testing prior to the release of a new version to production.
2. Near zero-downtime deployments.
3. Fast rollback should anything go wrong.
4. As new code is already loaded on to a parallel environment and the live site is unaffected, the Release and Test teams have no time pressures on quickly completing the push of new code to the website during a planned outage.
5. Allows us to test disaster recovery procedure each and every time we do a blue/green
6. Eco friendly, as we no longer have to keep IT hard-ware and infrastructure on stand-by in a duplicate data centre. We simply spin up, on demand a pre-prod environment using the cloud provided by Amazon Web Services (AWS).
7. Enables continuous incremental service improvement, so our website will always be evolving and is easier to change and update when required.
8. No more “big bang” changes like the launch of a whole new website, because the website will now theoretically never become out of date.
9. We have developed a process to synchronise databases before and after each switch to ensure no loss of customers transactions (web form data) during the cut-over.
10. Facilitates planned software releases based on a release cycle, (currently every 2 weeks)
11. Reduces risk as we are able to regression, soak and load test in an exact replica of the production environment before deploying to live.
12. Reduces customer impacts as virtually no planned outages or down-time of the website.
13. Improves confidence levels in the release package and allows for easier, pressure free, trouble-shooting along the release pipe-line.
14. Releases can be scheduled during office hours. Currently we schedule blue/green switches between the peak commute rush hours, so release window can be anytime between 10am and 4pm.
Blue/green is a powerful technique to manage software releases, especially when using cloud infrastructure. Our cloud provider (AWS) enables us to easily create new on-demand environments at the push of a button and provides different cost-effective options to implement blue/green deployments.
Since go-live of the new website on 24/3/14, we have deployed numerous new releases of software as either bug fixes, new functionality, or weekly updates of reference data, using blue/green, which has worked seamlessly and customers have experienced virtually zero down-time.
Now that we have now adopted a Continuous Integration and Deployment (CI/CD) pipeline, using the blue/green approach, we’ll always be continually improving, so the website will incrementally grow and evolve with customer needs – and we too will evolve, adapt and grow with all the cutting edge, technologies employed in our new website.
For further reading/reference, see below links to some other articles/blogs on using the blue/green approach;