Database Migrations: Schema Changes Without Data Loss
Production Ready — Part 15 of 30
It was a Tuesday afternoon in February 2026. A developer pushed what looked like a routine cleanup migration — dropping an old status column from the orders table. The column had already been removed from all the Eloquent models. Code review approved. Tests green. Deployment started.
Twenty seconds later, Sentry lit up like a Christmas tree.
A legacy API endpoint — rarely used, zero test coverage — was still hitting that column through a raw query. During the rolling deployment, older app instances were still alive and making requests. Every single one failed. Twenty-three minutes of partial downtime. An awkward conversation with the CTO.
That story, shared on Reddit in February 2026, is so painfully common it practically has its own genre. And it's not the worst version. In March 2026, a developer accidentally left a DROP TABLE IF EXISTS in a migration file from local testing. It got pushed to main. The automated deploy ran it against production. 1,247 customer records — gone.
Database migrations are where vibe coding meets its harshest reality check. The database doesn't care about your confidence. It executes what you tell it to.
Let's talk about how to get this right.
Why Migrations Are Uniquely Dangerous
Code deployments are (mostly) reversible. You push bad code, you roll back. But database migrations are a different beast:
- They're often irreversible — a dropped column takes its data with it
- They run against live data — not a test environment copy
- They can lock tables — blocking reads and writes for minutes on large tables
- Old and new code run simultaneously — during rolling deploys, both versions hit the same schema
That last point is what killed the developer in our opening story. They thought about the new code. They forgot about the old code that was still running.
GitHub learned a similar lesson the hard way. Their engineering team documented a postmortem where a schema migration on a large MySQL table caused a semaphore deadlock across multiple read replicas. The final rename step — the last action of a migration that had taken weeks to complete — sent a cascade of replicas into crash-recovery mode. Not enough healthy replicas remained to handle production traffic. Core GitHub services went down.
Schema migrations are not a detail. They are a critical path operation.
The Expand/Contract Pattern: Your New Religion
The single most important concept in safe database migrations is expand/contract (also called parallel change). Stop thinking about migrations as atomic operations. Start thinking about them as a two-phase process across multiple deployments.
Phase 1 — Expand: Add new structure alongside old structure. Never remove anything.
Phase 2 — Contract: Once all code is updated and verified, remove the old structure.
Here's what this looks like in practice. Say you want to rename user_name to display_name in your users table.
The wrong way (single migration, single deploy):
-- DON'T DO THIS IN ONE SHOT
ALTER TABLE users RENAME COLUMN user_name TO display_name;
The moment this runs, any old app instances still reading user_name break.
The right way (three deployments, two migrations):
Deploy 1 — Expand migration:
-- Add the new column, copy existing data
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
UPDATE users SET display_name = user_name;
Deploy 1 — Application code: Write to both columns, read from display_name.
# Write to both during transition
def update_user_name(user_id, new_name):
db.execute("""
UPDATE users
SET display_name = %s, user_name = %s
WHERE id = %s
""", (new_name, new_name, user_id))
Deploy 2 — Application code: Read from display_name only. Stop writing to user_name.
Deploy 3 — Contract migration (after verifying everything is stable):
-- Only now, after code is fully deployed and verified
ALTER TABLE users DROP COLUMN user_name;
This feels slower. It is slower. It's also the only way to rename a column in a live system without causing an outage.
The Non-Null Column Trap
Here's another classic that surprises even experienced developers:
# Alembic migration
def upgrade():
op.add_column('products',
sa.Column('sku', sa.String(50), nullable=False)
)
On a table with a million rows and no default value, this migration will either fail immediately (Postgres) or lock the table for minutes while it backfills (older MySQL). Either way, your application is having a bad time.
The safe version:
def upgrade():
# Step 1: Add as nullable with a default
op.add_column('products',
sa.Column('sku', sa.String(50), nullable=True, server_default='')
)
# Step 2: Backfill in batches (don't do it all at once)
op.execute("""
UPDATE products SET sku = CONCAT('LEGACY-', id::text)
WHERE sku IS NULL OR sku = ''
""")
# Step 3: Add the NOT NULL constraint only after data exists
op.alter_column('products', 'sku', nullable=False)
For truly large tables, even that batch UPDATE can be dangerous. The professional move is to do the backfill in application code, in batches with delays, outside the migration itself:
# Separate backfill script — run this BEFORE the NOT NULL constraint
import time
def backfill_sku_in_batches(batch_size=1000):
offset = 0
while True:
updated = db.execute("""
UPDATE products
SET sku = CONCAT('LEGACY-', id::text)
WHERE id IN (
SELECT id FROM products
WHERE sku IS NULL
LIMIT %s OFFSET %s
)
""", (batch_size, offset)).rowcount
if updated == 0:
break
offset += batch_size
time.sleep(0.1) # Give the database room to breathe
print(f"Backfilled {offset} rows...")
Locking: The Silent Killer
Adding an index to a large table in older MySQL versions takes a write lock. Adding a column in PostgreSQL before version 11 could lock the table. Even in modern databases, some operations are more disruptive than you'd expect.
For Postgres, always prefer the concurrent index build:
-- This blocks writes for the duration (BAD on large tables)
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- This doesn't (GOOD — takes longer but non-blocking)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
For Alembic, you'll need a custom operation:
from alembic import op
def upgrade():
# Disable transaction for concurrent index
op.execute('COMMIT')
op.execute(
'CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id)'
)
For Django, use DatabaseSchemaEditor.effective_default and set database_backwards carefully, or use the django-pg-zero-downtime-migrations package which wraps common operations with safe defaults.
Your Pre-Migration Checklist
Before any migration touches production, run through this:
# 1. Take a snapshot backup right before
pg_dump production_db > backup_$(date +%Y%m%d_%H%M%S).sql
# 2. Test the migration against a production-size dataset
# (staging with a copy of prod data, NOT a small dev dataset)
python manage.py migrate --database=staging_prod_copy
# 3. Time the migration on that dataset
time psql staging_db < migration.sql
# 4. Verify rollback works
python manage.py migrate myapp 0014 # Roll back to previous
# 5. Check for destructive operations
grep -r "DROP TABLE\|DROP COLUMN\|TRUNCATE" migrations/
In CI, add a step that blocks merges with destructive migrations:
# .github/workflows/migration-safety.yml
- name: Check for destructive migrations
run: |
if grep -rn "DROP TABLE\|TRUNCATE TABLE" migrations/; then
echo "::error::Destructive migration detected. Manual review required."
exit 1
fi
The Rollback Strategy Nobody Has
According to research from Cloudficient, 23% of organizations experience some data loss during migration. A defining theme across major outages in 2025 was teams discovering they had no viable rollback — either no backup existed, the backup hadn't been tested, or the backup was six hours old. The GitLab database incident that wiped six hours of production data happened because backups were misconfigured and untested.
Before every production migration:
- Verify your backup exists and is recent — check the timestamp, don't assume
- Test the restore process — on a separate environment, not in theory
- Document the rollback SQL — write the down-migration before you need it
- Set a rollback decision point — if X errors occur within Y minutes, roll back immediately
A rollback plan you've never tested is not a rollback plan. It's a hope.
Production Migration Checklist
- Backup taken and timestamp verified within the last 30 minutes
- Migration tested against production-scale data in staging
- Migration timing measured (know how long it will take)
- No destructive operations (
DROP,TRUNCATE,RENAME) without explicit expand/contract plan - New NOT NULL columns have defaults or are added nullable first
- Indexes created with
CONCURRENTLY(Postgres) or equivalent - Application code is compatible with both old AND new schema
- Rollback SQL written and tested
- Monitoring/alerting active for the deployment window
- Team member on standby with rollback authority
- CI pipeline blocks merges with unreviewed destructive migrations
Ask The Guild
What's the scariest database migration story from your production experience? Did you ever drop a column that was still in use? Run a migration without a backup? Discover a rollback plan that didn't work? Share your war story in the Guild — your mistake might save someone else's data. And if you've developed a migration safety workflow that you'd swear by, we want to hear that too.
Tom Hundley is a software architect with 25 years of experience and the author of the Production Ready series. He has personally caused, witnessed, and survived more database incidents than he cares to count.