Database Migrations: Schema Changes Without Data Loss

Production Ready — Part 15 of 30

It was a Tuesday afternoon in February 2026. A developer pushed what looked like a routine cleanup migration — dropping an old status column from the orders table. The column had already been removed from all the Eloquent models. Code review approved. Tests green. Deployment started.

Twenty seconds later, Sentry lit up like a Christmas tree.

A legacy API endpoint — rarely used, zero test coverage — was still hitting that column through a raw query. During the rolling deployment, older app instances were still alive and making requests. Every single one failed. Twenty-three minutes of partial downtime. An awkward conversation with the CTO.

That story, shared on Reddit in February 2026, is so painfully common it practically has its own genre. And it's not the worst version. In March 2026, a developer accidentally left a DROP TABLE IF EXISTS in a migration file from local testing. It got pushed to main. The automated deploy ran it against production. 1,247 customer records — gone.

Database migrations are where vibe coding meets its harshest reality check. The database doesn't care about your confidence. It executes what you tell it to.

Let's talk about how to get this right.

Why Migrations Are Uniquely Dangerous

Code deployments are (mostly) reversible. You push bad code, you roll back. But database migrations are a different beast:

They're often irreversible — a dropped column takes its data with it
They run against live data — not a test environment copy
They can lock tables — blocking reads and writes for minutes on large tables
Old and new code run simultaneously — during rolling deploys, both versions hit the same schema

That last point is what killed the developer in our opening story. They thought about the new code. They forgot about the old code that was still running.

GitHub learned a similar lesson the hard way. Their engineering team documented a postmortem where a schema migration on a large MySQL table caused a semaphore deadlock across multiple read replicas. The final rename step — the last action of a migration that had taken weeks to complete — sent a cascade of replicas into crash-recovery mode. Not enough healthy replicas remained to handle production traffic. Core GitHub services went down.

Schema migrations are not a detail. They are a critical path operation.

The Expand/Contract Pattern: Your New Religion

The single most important concept in safe database migrations is expand/contract (also called parallel change). Stop thinking about migrations as atomic operations. Start thinking about them as a two-phase process across multiple deployments.

Phase 1 — Expand: Add new structure alongside old structure. Never remove anything.

Phase 2 — Contract: Once all code is updated and verified, remove the old structure.

Here's what this looks like in practice. Say you want to rename user_name to display_name in your users table.

The wrong way (single migration, single deploy):

-- DON'T DO THIS IN ONE SHOT
ALTER TABLE users RENAME COLUMN user_name TO display_name;

The moment this runs, any old app instances still reading user_name break.

The right way (three deployments, two migrations):

Deploy 1 — Expand migration:

-- Add the new column, copy existing data
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
UPDATE users SET display_name = user_name;

Deploy 1 — Application code: Write to both columns, read from display_name.

# Write to both during transition
def update_user_name(user_id, new_name):
    db.execute("""
        UPDATE users 
        SET display_name = %s, user_name = %s 
        WHERE id = %s
    """, (new_name, new_name, user_id))

Deploy 2 — Application code: Read from display_name only. Stop writing to user_name.

Deploy 3 — Contract migration (after verifying everything is stable):

-- Only now, after code is fully deployed and verified
ALTER TABLE users DROP COLUMN user_name;

This feels slower. It is slower. It's also the only way to rename a column in a live system without causing an outage.

The Non-Null Column Trap

Here's another classic that surprises even experienced developers:

# Alembic migration
def upgrade():
    op.add_column('products', 
        sa.Column('sku', sa.String(50), nullable=False)
    )

On a table with a million rows and no default value, this migration will either fail immediately (Postgres) or lock the table for minutes while it backfills (older MySQL). Either way, your application is having a bad time.

The safe version:

def upgrade():
    # Step 1: Add as nullable with a default
    op.add_column('products',
        sa.Column('sku', sa.String(50), nullable=True, server_default='')
    )
    
    # Step 2: Backfill in batches (don't do it all at once)
    op.execute("""
        UPDATE products SET sku = CONCAT('LEGACY-', id::text)
        WHERE sku IS NULL OR sku = ''
    """)
    
    # Step 3: Add the NOT NULL constraint only after data exists
    op.alter_column('products', 'sku', nullable=False)

For truly large tables, even that batch UPDATE can be dangerous. The professional move is to do the backfill in application code, in batches with delays, outside the migration itself:

# Separate backfill script — run this BEFORE the NOT NULL constraint
import time

def backfill_sku_in_batches(batch_size=1000):
    offset = 0
    while True:
        updated = db.execute("""
            UPDATE products 
            SET sku = CONCAT('LEGACY-', id::text)
            WHERE id IN (
                SELECT id FROM products 
                WHERE sku IS NULL 
                LIMIT %s OFFSET %s
            )
        """, (batch_size, offset)).rowcount
        
        if updated == 0:
            break
            
        offset += batch_size
        time.sleep(0.1)  # Give the database room to breathe
        print(f"Backfilled {offset} rows...")

Locking: The Silent Killer

Adding an index to a large table in older MySQL versions takes a write lock. Adding a column in PostgreSQL before version 11 could lock the table. Even in modern databases, some operations are more disruptive than you'd expect.

For Postgres, always prefer the concurrent index build:

-- This blocks writes for the duration (BAD on large tables)
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- This doesn't (GOOD — takes longer but non-blocking)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

For Alembic, you'll need a custom operation:

from alembic import op

def upgrade():
    # Disable transaction for concurrent index
    op.execute('COMMIT')
    op.execute(
        'CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id)'
    )

For Django, use DatabaseSchemaEditor.effective_default and set database_backwards carefully, or use the django-pg-zero-downtime-migrations package which wraps common operations with safe defaults.

Your Pre-Migration Checklist

Before any migration touches production, run through this:

# 1. Take a snapshot backup right before
pg_dump production_db > backup_$(date +%Y%m%d_%H%M%S).sql

# 2. Test the migration against a production-size dataset
# (staging with a copy of prod data, NOT a small dev dataset)
python manage.py migrate --database=staging_prod_copy

# 3. Time the migration on that dataset
time psql staging_db < migration.sql

# 4. Verify rollback works
python manage.py migrate myapp 0014  # Roll back to previous

# 5. Check for destructive operations
grep -r "DROP TABLE\|DROP COLUMN\|TRUNCATE" migrations/

In CI, add a step that blocks merges with destructive migrations:

# .github/workflows/migration-safety.yml
- name: Check for destructive migrations
  run: |
    if grep -rn "DROP TABLE\|TRUNCATE TABLE" migrations/; then
      echo "::error::Destructive migration detected. Manual review required."
      exit 1
    fi

The Rollback Strategy Nobody Has

According to research from Cloudficient, 23% of organizations experience some data loss during migration. A defining theme across major outages in 2025 was teams discovering they had no viable rollback — either no backup existed, the backup hadn't been tested, or the backup was six hours old. The GitLab database incident that wiped six hours of production data happened because backups were misconfigured and untested.

Before every production migration:

Verify your backup exists and is recent — check the timestamp, don't assume
Test the restore process — on a separate environment, not in theory
Document the rollback SQL — write the down-migration before you need it
Set a rollback decision point — if X errors occur within Y minutes, roll back immediately

A rollback plan you've never tested is not a rollback plan. It's a hope.

Production Migration Checklist

Backup taken and timestamp verified within the last 30 minutes
Migration tested against production-scale data in staging
Migration timing measured (know how long it will take)
No destructive operations (DROP, TRUNCATE, RENAME) without explicit expand/contract plan
New NOT NULL columns have defaults or are added nullable first
Indexes created with CONCURRENTLY (Postgres) or equivalent
Application code is compatible with both old AND new schema
Rollback SQL written and tested
Monitoring/alerting active for the deployment window
Team member on standby with rollback authority
CI pipeline blocks merges with unreviewed destructive migrations

Ask The Guild

What's the scariest database migration story from your production experience? Did you ever drop a column that was still in use? Run a migration without a backup? Discover a rollback plan that didn't work? Share your war story in the Guild — your mistake might save someone else's data. And if you've developed a migration safety workflow that you'd swear by, we want to hear that too.

Tom Hundley is a software architect with 25 years of experience and the author of the Production Ready series. He has personally caused, witnessed, and survived more database incidents than he cares to count.

Why Migrations Are Uniquely Dangerous

The Expand/Contract Pattern: Your New Religion

The Non-Null Column Trap

Locking: The Silent Killer

Your Pre-Migration Checklist

The Rollback Strategy Nobody Has

Production Migration Checklist

Ask The Guild

Review and debug

v0 by Vercel — UI Components From a Text Prompt

Use this production insight inside a full build sequence

Building a Real Product

About Tom Hundley