Rollbacks: Undo a Bad Deploy in 30 Seconds

Production Ready -- Part 23 of 30

It was a Friday at 6:47 PM. Marcus, a solo developer running a small SaaS tool for freelancers, had been chasing a nasty bug in his invoicing flow all week. He finally had it. He pushed to main, Vercel deployed in 90 seconds, and he closed his laptop feeling like a hero.

By 7:15 PM, his support inbox had 14 messages. The invoice generation was broken -- not just for new invoices, but for anyone who tried to view old ones. Turns out the database query refactor he thought was contained had silently broken a shared utility function used everywhere.

Marcus did not have a rollback strategy. He spent the next two hours pushing hot fixes, each one creating new problems. He lost three paying customers that weekend.

This is not a story about incompetence. This is a story about skipping the safety net that every serious production system needs: a fast, practiced rollback procedure.

Every Deploy Is a Bet. Rollback Is Your Hedge.

You are never 100% certain a deploy is clean. You tested it. You reviewed it. Your staging environment passed. But production has real users, real data, and real edge cases that staging never surfaces. The Cockroach Labs State of Resilience 2025 report, which surveyed 1,000 senior technology executives worldwide, found that 100% of organizations experienced outage-related revenue loss in a single year, with per-incident losses reaching as high as $1 million. Fifty-five percent report outages at least weekly.

Even the hyperscalers get it wrong. On October 29, 2025, a single inadvertent configuration change in Azure Front Door propagated across Microsoft's global edge network, taking down Azure services, Microsoft 365, Outlook, and Xbox simultaneously. Microsoft's fix? Roll back to the last known good configuration. That is it. Detection, freeze, rollback. Recovery took hours because of DNS propagation -- but the fix itself was simple.

If Microsoft relies on rollback as its primary incident response tool, so should you.

Vercel Instant Rollback: Seconds, Not Minutes

If you deploy on Vercel, you have the best rollback story in the business. Vercel keeps every previous production deployment alive and aliased. When something breaks, you are not rebuilding from scratch -- you are flipping a pointer.

From the dashboard:

Open your project on vercel.com
Click Instant Rollback on the production deployment tile
Select the deployment you want to restore
Click Confirm Rollback

Traffic redirects immediately. The old deployment was never torn down; Vercel just routes your domain back to it.

From the CLI, it is one command:

vercel rollback

Or target a specific deployment by URL or ID:

vercel rollback https://your-app-abc123.vercel.app

Check rollback status:

vercel rollback status

One important detail per Vercel's documentation: after a rollback, Vercel disables auto-deployment. New pushes to your main branch will not go live automatically until you explicitly undo the rollback or promote a new deployment. This is intentional -- it prevents a broken CI pipeline from immediately overwriting your recovery.

Pro and Enterprise plans can roll back to any previous production deployment. Hobby plans can roll back to the immediately previous one. If you are running anything serious, that alone is worth the upgrade.

Git-Based Rollback: When You Need to Go Deeper

Not every project runs on Vercel. For self-hosted setups or custom CI pipelines, your rollback tool is git.

Option 1: Revert the commit (clean, auditable)

git revert HEAD
git push origin main

git revert creates a new commit that undoes the changes from the last commit. It preserves history, which is what you want in production. Your pipeline picks up the new commit, rebuilds, and deploys. This is slower than a Vercel instant rollback because you are triggering a full build.

Option 2: Force-push to a previous commit (fast, but destructive)

git log --oneline -10          # find the good commit hash
git reset --hard abc1234       # reset to that commit
git push --force origin main   # overwrite remote history

Use force-push only when speed matters more than history. It rewrites your branch history, which can cause confusion for anyone else on the team. Always communicate before force-pushing to a shared branch.

Check your GitHub Actions or CI logs immediately after -- confirm the deploy job ran and passed health checks before you declare the incident resolved.

Database Migrations: The Hard Part Nobody Talks About

Code rollbacks are easy. Database rollbacks are where developers get burned.

Here is the problem: when you deploy version 2 of your app, you often also run a database migration -- adding a column, renaming a field, changing a constraint. If you roll the code back to version 1 but the database is still at the version 2 schema, your rolled-back app may fail in completely new ways.

The solution is down migrations. Every migration you write should have a corresponding rollback:

-- up: add user timezone column
ALTER TABLE users ADD COLUMN timezone VARCHAR(50) DEFAULT 'UTC';

-- down: remove it
ALTER TABLE users DROP COLUMN timezone;

In practice, most migration tools (Flyway, Liquibase, Prisma, Rails) support up and down directions. The discipline is actually writing the down migration at the same time you write the up. Do not leave it for later. Later is never.

The safer approach for production is expand-contract migrations:

Expand: Add the new column but keep the old one. Deploy. Both versions of the app work.
Contract: After the new code has been stable for a few days, remove the old column.

This eliminates the need for emergency database rollbacks entirely. Your code can always roll back because the database schema supports both versions simultaneously.

Feature Flags: The Rollback That Is Not a Rollback

Sometimes the cleanest rollback is not a rollback at all -- it is a feature flag flip.

Feature flags let you deploy code that is dormant until you activate it. If something goes wrong, you disable the flag. No rebuild. No git revert. No deployment pipeline. Just a config change that takes effect in seconds.

if (featureFlags.isEnabled('new-invoicing-flow', userId)) {
  return newInvoicingFlow(invoice);
} else {
  return legacyInvoicingFlow(invoice);
}

Tools like LaunchDarkly, Flagsmith, and Unleash are purpose-built for this. At its simplest, you can even use an environment variable or a row in your database.

Feature flags shine when you are rolling out to a subset of users first -- catching problems before they affect everyone. They are not a replacement for deployment rollbacks; they are a complementary layer that reduces how often you need one.

Zero-Downtime Rollback Strategy

A properly executed rollback should be invisible to users. Here is how to achieve that:

Keep the previous deployment warm. Vercel does this automatically. On self-hosted infrastructure, blue-green deployments maintain two live environments so traffic can switch instantly.
Use a load balancer as the traffic controller. The rollback is just a routing change, not a process restart. Users with in-flight requests complete them against the old version; new requests go to the restored version.
Drain gracefully. If your app handles long-running requests (file uploads, payment processing), give the current version time to finish before pulling the plug.
Confirm health before declaring done. Hit your health check endpoint, check your error rate in Sentry or Datadog, and watch your logs for 60 seconds before you close the incident.

The 30-Second Incident Response Checklist

Something broke. The clock is running. Here is exactly what to do:

0-10 seconds:

Open your monitoring dashboard (Sentry, Datadog, Vercel Analytics)
Confirm the error spike correlates with your deploy timestamp
If yes: trigger rollback immediately, ask questions later

10-20 seconds:

On Vercel: click Instant Rollback
On CLI: vercel rollback or git revert HEAD && git push
Post in your team Slack/Discord: "Rolling back deploy from [time], investigating"

20-30 seconds:

Confirm rollback completed (check dashboard or vercel rollback status)
Watch error rate for 60 seconds to confirm it drops
Check the health endpoint: curl https://yourapp.com/health

After stabilization:

Open an incident doc. Write down what happened, even if it is just you.
Reproduce the bug in staging before you attempt another fix
Do not redeploy until you understand root cause

Monitoring: Know Fast When Something Breaks

You cannot roll back what you cannot detect. The faster your alerting, the less damage a bad deploy causes.

The non-negotiables:

Error rate monitoring: Sentry, Rollbar, or Datadog should alert you within 60 seconds of an error rate spike. Set a threshold of 1% error rate triggering a page.
Synthetic health checks: A dead-simple endpoint that returns 200 OK and pings a database. UptimeRobot, Better Uptime, or Vercel's built-in checks hit this every minute.
Deploy markers: Tag every deploy in your monitoring tool so error spikes are instantly correlated with a specific commit. Sentry does this automatically if you integrate with GitHub.

If you are not alerted within two minutes of a broken deploy, your monitoring is not doing its job.

Practice Rollbacks Before You Need Them

The worst time to figure out your rollback procedure is during an actual incident. The Cockroach Labs report found that fewer than one-third of organizations conduct regular failover testing, which explains why so many teams are caught flat-footed when things go wrong.

Add a rollback drill to your staging routine:

Deploy a known-bad commit to staging
Trigger a rollback using your exact production procedure
Confirm the rollback restored the previous behavior
Time yourself

If your rollback takes more than five minutes in staging, it will take longer under pressure in production. Drill until it is muscle memory.

Action Items

Confirm your Vercel plan allows rollbacks to previous deployments (not just the immediately prior one)
Run vercel rollback --help and verify you know the command before you need it
Write a down migration for every database migration you author from now on
Add a /health endpoint to your production app if you do not have one
Set up a Sentry alert that fires when error rate exceeds 1% within 5 minutes of a deploy
Perform one rollback drill in staging this week -- time it
Identify one feature in your roadmap that could ship behind a feature flag instead of a direct deploy
Create a one-page incident response runbook, even if it is just a Notion doc

Ask The Guild

What is the worst deployment incident you have recovered from, and what was the tool or technique that saved you? Share in the community -- specifics welcome. Timestamps, error messages, the command that fixed it. The more concrete, the more useful for everyone building alongside you.