Angga.
← Back to all posts
devopsdeploymentci-cd

Zero-downtime deploys: blue-green and canary

How blue-green deployments and canary releases actually achieve zero downtime — the traffic mechanics, the database problem nobody mentions, the tooling that runs them, and when not to bother.

· 9 min read

Every team I've worked on eventually has the same uncomfortable conversation: "when can we deploy without taking the site down?" For a side project, a few seconds of downtime during a release is nothing. For a B2B marketplace taking orders during a procurement window, it is lost revenue and a support ticket. Zero-downtime deployment is the set of techniques that make a release invisible to the people using your product.

Two strategies come up again and again — blue-green and canary. They're often presented as rivals. They're not; they solve the problem from different angles, and mature teams use both. This is my working understanding of how they actually behave in production, including the parts the tutorials skip.

The problem with a naive deploy

The simplest deploy — stop the old version, swap the binary, start the new one — has a window where the service is down. Even a rolling restart has a moment where a request can land on a half-started instance and get a 502.

For a low-traffic app you can hide that behind a maintenance page at 3am. For anything people depend on during business hours, you need two things the naive deploy can't give you:

  • The new version must be fully ready before a single user touches it.
  • You need a fast way back if it turns out to be wrong.

Blue-green and canary are two ways to get both.

Blue-green: two identical environments, one switch

Blue-green keeps two production-grade environments side by side. One is live (Blue), serving all traffic; the other is idle (Green), where you stage the new version.

The flow:

  • Stage Green. Deploy v2 to the idle environment. It's isolated from users, so you can smoke-test it against real infrastructure without anyone noticing.
  • Flip. Once Green passes, point the load balancer at it. The cutover is effectively instant — one config change — so there's no downtime.
  • Watch, and keep Blue warm. Monitor v2 closely. If it misbehaves, you flip the load balancer back to Blue. Rollback is as fast as the cutover was.

What's good: the cutover and the rollback are both a single, instant operation. That "big red rollback button" is the real reason teams pick blue-green for critical systems — when a release goes wrong, you're back to the known-good version in seconds, not in however long a redeploy takes.

The catch: during the release window you're running two full environments, which roughly doubles that slice of your infra bill. And 100% of users move to v2 at once — if your testing missed something, everyone hits it simultaneously (you just recover fast).

Canary: leak it out slowly

A canary release does the opposite. Instead of moving everyone at once, you expose the new version to a small slice of traffic first, watch, and ramp up only if it stays healthy. (The name is the coal-mine canary — a small early-warning signal.)

The flow:

  • Release to a sliver. Route ~5% of traffic to v2; the other 95% stays on v1.
  • Watch the metrics that matter. Error rate, p99 latency, saturation, and any business metric the change touches. Compare the canary's numbers against the stable version's, not against an absolute threshold.
  • Ramp or roll back. If the canary looks healthy, step it up — 5% → 25% → 50% → 100%. If a metric goes sideways, halt and pull that slice back to v1. Only the canary users were ever affected.

What's good: the blast radius is whatever percentage you've ramped to. A bug that would have hit everyone now hits 5% of users for the 90 seconds before your automation rolls it back. You also get to test against real production traffic — the messy inputs that never show up in staging.

The catch: canary only works if you have the observability and automation to make the promote-or-rollback decision quickly. Eyeballing a Grafana dashboard doesn't scale. And because v1 and v2 run at the same time, they have to be compatible with each other — which leads to the part most write-ups gloss over.

The part the tutorials skip: state and the database

Blue-green and canary are easy when your app is stateless. The moment a database is involved, reality bites: both versions almost always talk to the same database. You can't casually clone production data and keep two copies in sync mid-deploy.

That has a hard consequence: v1 and v2 must both work against the same schema at the same time. Which means you cannot ship a destructive migration — renaming a column, dropping a table, changing a type — in one step. Whichever version isn't expecting the change will break.

The fix is the expand/contract pattern (also called parallel change). Splitting a column rename across releases looks like:

1. Expand   add the new column; deploy code that writes BOTH old and new
2. Backfill copy existing rows into the new column
3. Migrate  deploy code that READS the new column
4. Contract once nothing reads the old column, drop it (a later release)

Every individual step is backward-compatible, so a canary or a blue-green flip is safe at every point. It's more work than "just rename the column," and it's the actual cost of zero-downtime that nobody quotes you up front.

There's a smaller cousin too: draining. When you flip the load balancer, requests already in flight need to finish before the old instances go away. Long-lived connections — websockets, server-sent events — don't migrate; you either wait them out or accept dropping them. Set a connection-drain timeout and respect it.

Blue-green vs canary: when to reach for which

                  Blue-green                  Canary
----------------  --------------------------  --------------------------
Traffic cutover   All at once (instant)       Gradual (5% → 100%)
Infra cost        2x full environments        A few extra instances
Rollback          Flip back to Blue           Halt + revert the subset
Blast radius      100% if it's wrong          Only the % you've ramped to
Time to roll out  Seconds                     Minutes to hours
Hard requirement  A second environment        Real metrics + automation

The rough rule I use: blue-green when the release is well-tested and I want an instant rollback lever; canary when the change is risky and I'd rather discover problems on 5% of users than 100%. Critical, well-understood services lean blue-green. New features and uncertain changes lean canary. In a microservices setup you can mix per service — canary the risky one, blue-green the one that must never blink.

What actually implements this

These are concepts; here's what runs them in production:

  • AWS — an Application Load Balancer with weighted target groups splits traffic by percentage (that's your canary dial). CodeDeploy has a native blue/green mode for ECS and Lambda that manages the second environment and the cutover for you.
  • KubernetesArgo Rollouts and Flagger automate canary analysis: they watch Prometheus metrics and promote or roll back without a human in the loop.
  • Service meshIstio or Linkerd do request-level traffic splitting, which lets you canary by header or user cohort, not just by percentage.
  • Feature flagsLaunchDarkly, Unleash, or a homegrown flag table.

Feature flags: decouple deploy from release

The third leg people forget. Blue-green and canary control which version of the code is running. Feature flags control which features are switched on — independently of any deploy.

You ship the new code dark (flag off), deploy it through your normal pipeline, and then turn the feature on for 1% of users from a dashboard — no redeploy. If it misbehaves, you flip the flag off in seconds. In practice the mature setup combines all three: deploy with canary, keep blue-green's rollback lever for the critical services, and gate the genuinely scary changes behind a flag so "release" and "deploy" are separate decisions.

When not to bother

Honest right-sizing, because cargo-culting Netflix's pipeline onto a weekend project is its own kind of mistake. If you're running a side project or a low-traffic internal tool, a rolling deploy with a health check is plenty — the orchestrator waits for the new instance to pass health checks before retiring the old one, and that already gets you most of the way to invisible.

Reach for blue-green and canary when downtime has a real, quantifiable cost: revenue, an SLA with penalties, or a user base big enough that someone always notices. Blue-green doubles your infra spend during the window; canary needs observability you may not have built yet. The technique should match the stakes.

Closing

Both strategies are really about one thing: never let your users be the ones who discover the deploy went wrong. Blue-green gives you a big red rollback button. Canary makes sure that if you do break something, you break it for 5% of people for ninety seconds instead of everyone for an hour. Feature flags let you separate the act of shipping code from the act of turning it on.

The systems I've worked on that took uptime seriously never picked just one. They canaried the risky services, kept blue-green for the ones that couldn't blink, and flagged the changes that genuinely scared them — and they treated backward-compatible database migrations as non-negotiable, because that's the part that actually makes any of it safe.

Enjoyed this? More posts coming weekly — see the full archive.