Designing a continuous deployment system: cautious deployment

This article will be about a deployment scheduling technique which I call operationally cautious deployment. (I didn't pick the name so the acronym would end up being OCD! Honest! I swear!).

Note: This isn't new. I didn't invent it. The fact that a big number of people around the globe with similar problems have come up with this idea independently and that most of them are happy with the results speaks for itself: this is mostly just common sense thinking applied to fixing a real problem. It's one of those things that sound obvious when you have them explained to you, but you probably wouldn't come up with unless you spend quiet a bit of time pondering the problem.

So you have a wonderful continuous deployment system! Congratulations! You deploy a new version, monitor your cluster of hosts, and create wonderful graphs:

[[posterous-content:NBFS9fiUjfTCi8Xr5KIW]]

Except, of course, it's not that simple. Your test suite, no matter how much you've honed and polished it, is imperfect. It will contain subtle assumptions that just aren't true in deployment. (When I say "subtle assumptions", I mean the kind that you don't know are assumptions until everything blows up.). If you're going to be deploying software all the time, deploying something that has a fatal mistake in it is a matter of when, not if. So one day, the real graph is going to look something like this:

[[posterous-content:5k8ymL4PxLDrYfbFgL8h]]
It would be much nicer if we could deploy more cautiously. Try it out, and only go through with the complete deployment. A sort of live test, deploying with a feedback loop. That might look a bit like this:

[[posterous-content:apvQ27GHbg2rEh2CWgEa]]
That deploys the new version to 2**N servers, with N starting at 0 and increasing until all servers use the target version. The number of servers deployed to is exponential, because that gives us the behavior we wanted: slow and cautious at first, and rapid as soon as we think it'll work.

Of course, that doesn't give you an example of a failure mode. Failures look a bit like this:
[[posterous-content:tibBqAUzF8OCiIOWmnDZ]]
Just like in the previous picture, a roll-out of a new version starts. Except this time, the new version doesn't actually work. The system catches this, and reverts the hosts that have already converted.

Of course, the package is still broken! You didn't fix the problem yet, you just reduced its impact. Your dev room should still look roughly like an control room from a bad Cold War movie right after the other party's launched a bunch of nukes.

It's just another bug, albeit a particularly nasty show-stopping one. Like all bugs, they're not so much bugs in the software as they are bugs in the test suite that incorrectly specified the requirements for the software. This becomes painfully obvious here: there's something broken, but your test suite didn't catch it. Hopefully, it's something that's at all testable.

The difference between this state of panic and the original state of panic is that, ideally, your users never noticed there was a problem.

There's one obvious problem with all of this. Being able to continue deployment on positive feedback and revert deployments on negative feedback relies on actually having feedback. Useful health metrics for a cluster are halfway between art and science, and are definitely outside of the scope of this article.

The next article will be about another benefit of cautious deployment: continuous availability. That one is probably going to be halfway between an article about continuous deployment and a rant about how people need to learn more statistics.