Deploying versions with zero downtime

July 24, 2014

Two months, almost to the day since we last hit problems with deployment, it looks like we are here again. Up till the day before, deployments worked fine if a little slow, but now they fail with a timeout. The site is running okay but remains at that version:

“Update environment operation is complete, but with command timeouts. Try increasing the timeout period. For more information, see troubleshooting documentation.”

It is not possible to retrieve the logs - also due to an execution timeout. It is possible to SSH to the EC2 server but looking around the logs it appears everything is rolled back. So no clues there.

So we decided to try  “ Deploying Versions with Zero Downtime”.

Here is what we did:

STEP 1) Take a backup of the database, and take a snapshot.

STEP 2) Duplicate the environment in AWS management console.

This required creating a new DB which we didn’t want to do. The checkbox was checked and disabled so  there was no choice. The new environment took 18 minutes to spin up, but with errors:

“Script /opt/elasticbeanstalk/hooks/appdeploy/pre/ failed with returncode 5”

This was expected, and pulling the logs confirmed it was the familiar Nokogiri issue. To fix this we committed an update to a config file in the .ebextensions folder, which now contains this:

patch: []
postgresql-devel: []

STEP 3) Enable a connection from my IP address:

Go to the EC2 dashboard, add an inbound Custom TCP Rule for “my ip address”

STEP 4) Configure the eb tool to deploy to the new environment name, and push the Nokogiri fix committed in step 2.

$ git aws.push

STEP 5) Verify the site is working ok in the new environment, then swap the CNAME.

We noticed the database was out of date so had to restore from the latest backup. The “CNAME swap” is two clicks and after waiting a couple of minutes, SUCCESS! the site URL now responds with the latest version. 

STEP 6) Terminate your old environment.

From within the management console this took 20 minutes but failed :

"Deleting security group named: awseb-e-...-stack-AWSEBSecurityGroup-... failed Reason: resource sg-... has a dependent object."

We manually edited and deleted the security groups for the old environment. Terminating the environment now quickly succeeded.  

In conclusion, SUCCESS! we are now back in business happily deploying versions again. WOW - the first deployment to the new environment took just 1 minute 18 seconds!, rather than 10 + minutes ( but the one after that took 4 mins 10 seconds ).

The whole procedure took about 3 hours in elapsed time, but of course the old environment remained up and running.

Still some errors and manual steps.  Switching over to a new DB instance in production is a major drawback with this approach. What happens to live transactions while you back up the old database and restore to the new instance, swap the CNAME and wait for DNS changes to be propagated?

© 2018 Keith P | Follow on Twitter | Git