Skip to content
Production Ready — Part 15 of 30

Database Migrations: Schema Changes Without Data Loss

Written by claude-sonnet-4 · Edited by claude-sonnet-4
databasemigrationsschema-changesproductiondata-safetypostgresqlalembicdjangozero-downtimedevops

Production Ready — Part 15 of 30


It was a Tuesday afternoon in February 2026. A developer pushed what looked like a routine cleanup migration — dropping an old status column from the orders table. The column had already been removed from all the Eloquent models. Code review approved. Tests green. Deployment started.

Twenty seconds later, Sentry lit up like a Christmas tree.

A legacy API endpoint — rarely used, zero test coverage — was still hitting that column through a raw query. During the rolling deployment, older app instances were still alive and making requests. Every single one failed. Twenty-three minutes of partial downtime. An awkward conversation with the CTO.

That story, shared on Reddit in February 2026, is so painfully common it practically has its own genre. And it's not the worst version. In March 2026, a developer accidentally left a DROP TABLE IF EXISTS in a migration file from local testing. It got pushed to main. The automated deploy ran it against production. 1,247 customer records — gone.

Database migrations are where vibe coding meets its harshest reality check. The database doesn't care about your confidence. It executes what you tell it to.

Let's talk about how to get this right.


Why Migrations Are Uniquely Dangerous

Code deployments are (mostly) reversible. You push bad code, you roll back. But database migrations are a different beast:

  1. They're often irreversible — a dropped column takes its data with it
  2. They run against live data — not a test environment copy
  3. They can lock tables — blocking reads and writes for minutes on large tables
  4. Old and new code run simultaneously — during rolling deploys, both versions hit the same schema

That last point is what killed the developer in our opening story. They thought about the new code. They forgot about the old code that was still running.

GitHub learned a similar lesson the hard way. Their engineering team documented a postmortem where a schema migration on a large MySQL table caused a semaphore deadlock across multiple read replicas. The final rename step — the last action of a migration that had taken weeks to complete — sent a cascade of replicas into crash-recovery mode. Not enough healthy replicas remained to handle production traffic. Core GitHub services went down.

Schema migrations are not a detail. They are a critical path operation.


The Expand/Contract Pattern: Your New Religion

The single most important concept in safe database migrations is expand/contract (also called parallel change). Stop thinking about migrations as atomic operations. Start thinking about them as a two-phase process across multiple deployments.

Phase 1 — Expand: Add new structure alongside old structure. Never remove anything.

Phase 2 — Contract: Once all code is updated and verified, remove the old structure.

Here's what this looks like in practice. Say you want to rename user_name to display_name in your users table.

The wrong way (single migration, single deploy):

-- DON'T DO THIS IN ONE SHOT
ALTER TABLE users RENAME COLUMN user_name TO display_name;

The moment this runs, any old app instances still reading user_name break.

The right way (three deployments, two migrations):

Deploy 1 — Expand migration:

-- Add the new column, copy existing data
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
UPDATE users SET display_name = user_name;

Deploy 1 — Application code: Write to both columns, read from display_name.

# Write to both during transition
def update_user_name(user_id, new_name):
    db.execute("""
        UPDATE users 
        SET display_name = %s, user_name = %s 
        WHERE id = %s
    """, (new_name, new_name, user_id))

Deploy 2 — Application code: Read from display_name only. Stop writing to user_name.

Deploy 3 — Contract migration (after verifying everything is stable):

-- Only now, after code is fully deployed and verified
ALTER TABLE users DROP COLUMN user_name;

This feels slower. It is slower. It's also the only way to rename a column in a live system without causing an outage.


The Non-Null Column Trap

Here's another classic that surprises even experienced developers:

# Alembic migration
def upgrade():
    op.add_column('products', 
        sa.Column('sku', sa.String(50), nullable=False)
    )

On a table with a million rows and no default value, this migration will either fail immediately (Postgres) or lock the table for minutes while it backfills (older MySQL). Either way, your application is having a bad time.

The safe version:

def upgrade():
    # Step 1: Add as nullable with a default
    op.add_column('products',
        sa.Column('sku', sa.String(50), nullable=True, server_default='')
    )
    
    # Step 2: Backfill in batches (don't do it all at once)
    op.execute("""
        UPDATE products SET sku = CONCAT('LEGACY-', id::text)
        WHERE sku IS NULL OR sku = ''
    """)
    
    # Step 3: Add the NOT NULL constraint only after data exists
    op.alter_column('products', 'sku', nullable=False)

For truly large tables, even that batch UPDATE can be dangerous. The professional move is to do the backfill in application code, in batches with delays, outside the migration itself:

# Separate backfill script — run this BEFORE the NOT NULL constraint
import time

def backfill_sku_in_batches(batch_size=1000):
    offset = 0
    while True:
        updated = db.execute("""
            UPDATE products 
            SET sku = CONCAT('LEGACY-', id::text)
            WHERE id IN (
                SELECT id FROM products 
                WHERE sku IS NULL 
                LIMIT %s OFFSET %s
            )
        """, (batch_size, offset)).rowcount
        
        if updated == 0:
            break
            
        offset += batch_size
        time.sleep(0.1)  # Give the database room to breathe
        print(f"Backfilled {offset} rows...")

Locking: The Silent Killer

Adding an index to a large table in older MySQL versions takes a write lock. Adding a column in PostgreSQL before version 11 could lock the table. Even in modern databases, some operations are more disruptive than you'd expect.

For Postgres, always prefer the concurrent index build:

-- This blocks writes for the duration (BAD on large tables)
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- This doesn't (GOOD — takes longer but non-blocking)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

For Alembic, you'll need a custom operation:

from alembic import op

def upgrade():
    # Disable transaction for concurrent index
    op.execute('COMMIT')
    op.execute(
        'CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id)'
    )

For Django, use DatabaseSchemaEditor.effective_default and set database_backwards carefully, or use the django-pg-zero-downtime-migrations package which wraps common operations with safe defaults.


Your Pre-Migration Checklist

Before any migration touches production, run through this:

# 1. Take a snapshot backup right before
pg_dump production_db > backup_$(date +%Y%m%d_%H%M%S).sql

# 2. Test the migration against a production-size dataset
# (staging with a copy of prod data, NOT a small dev dataset)
python manage.py migrate --database=staging_prod_copy

# 3. Time the migration on that dataset
time psql staging_db < migration.sql

# 4. Verify rollback works
python manage.py migrate myapp 0014  # Roll back to previous

# 5. Check for destructive operations
grep -r "DROP TABLE\|DROP COLUMN\|TRUNCATE" migrations/

In CI, add a step that blocks merges with destructive migrations:

# .github/workflows/migration-safety.yml
- name: Check for destructive migrations
  run: |
    if grep -rn "DROP TABLE\|TRUNCATE TABLE" migrations/; then
      echo "::error::Destructive migration detected. Manual review required."
      exit 1
    fi

The Rollback Strategy Nobody Has

According to research from Cloudficient, 23% of organizations experience some data loss during migration. A defining theme across major outages in 2025 was teams discovering they had no viable rollback — either no backup existed, the backup hadn't been tested, or the backup was six hours old. The GitLab database incident that wiped six hours of production data happened because backups were misconfigured and untested.

Before every production migration:

  1. Verify your backup exists and is recent — check the timestamp, don't assume
  2. Test the restore process — on a separate environment, not in theory
  3. Document the rollback SQL — write the down-migration before you need it
  4. Set a rollback decision point — if X errors occur within Y minutes, roll back immediately

A rollback plan you've never tested is not a rollback plan. It's a hope.


Production Migration Checklist

  • Backup taken and timestamp verified within the last 30 minutes
  • Migration tested against production-scale data in staging
  • Migration timing measured (know how long it will take)
  • No destructive operations (DROP, TRUNCATE, RENAME) without explicit expand/contract plan
  • New NOT NULL columns have defaults or are added nullable first
  • Indexes created with CONCURRENTLY (Postgres) or equivalent
  • Application code is compatible with both old AND new schema
  • Rollback SQL written and tested
  • Monitoring/alerting active for the deployment window
  • Team member on standby with rollback authority
  • CI pipeline blocks merges with unreviewed destructive migrations

Ask The Guild

What's the scariest database migration story from your production experience? Did you ever drop a column that was still in use? Run a migration without a backup? Discover a rollback plan that didn't work? Share your war story in the Guild — your mistake might save someone else's data. And if you've developed a migration safety workflow that you'd swear by, we want to hear that too.


Tom Hundley is a software architect with 25 years of experience and the author of the Production Ready series. He has personally caused, witnessed, and survived more database incidents than he cares to count.

Copy A Prompt Next

Review and debug

If this article changed how you think about the problem, copy a prompt that turns that judgment into one safe, reviewable next step.

Matching public prompts

23

Keep the task scoped, copy the prompt, then inspect one reviewable diff before the agent continues.

Need the safest first move instead? Open the curated sample prompts before you browse the broader library.

Working With AI ToolsWorking With AI Tools

v0 by Vercel — UI Components From a Text Prompt

Generate production-ready UI components with v0 and integrate them into your projects.

Preview
"I want v0 to generate a React component for this screen:
[describe the UI, data fields, visual style, empty state, loading state, and mobile behavior]
The component must:
1. work in a Next.js + Tailwind project
2. be easy to wire to real data later
Production Ready

Use this production insight inside a full build sequence

Production articles show you what breaks in the real world. The right path turns that lesson into a sequence you can ship with instead of just nodding at.

Best Next Path

Building a Real Product

Guild Member · $29/mo

Bridge demos to software people can trust: auth, billing, email, analytics, and the surrounding product plumbing.

20 lessonsIncluded with the full Guild Member library

Need the free route first?

Start with Start Here — Build Safely With AI if you want the workflow and vocabulary before you dive into the deeper path above.

T

About Tom Hundley

Tom Hundley writes for builders who need stronger technical judgment around AI-assisted software work. The Guild turns production experience into public articles, copy-paste prompts, and structured learning paths that help non-software developers supervise AI agents more safely.

Do this next

Leave this article with one concrete move. Copy the matching prompt, or start with the path that teaches the safest next skill in sequence.