Zero-Downtime Database Migrations

Strategies for running database migrations without service interruptions.

August 28, 2025 Backend , Engineering 2 min read

Database migrations are the most dangerous part of any deployment. A bad migration can lock tables for hours, corrupt data, or take down your entire application. At scale, you can’t afford downtime — which means you can’t afford traditional migrations.

We’ve deployed 200+ schema changes to production with zero downtime using the expand-contract pattern. Here’s how it works.

A traditional migration looks like this:

-- DANGEROUS: This locks the orders table during migration
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';

On a table with 50M rows, this ALTER TABLE can take 30+ minutes. During that time:

Writes are blocked
Reads may be blocked (depending on the database)
Your application is degraded or down

Instead of changing the schema in one step, we do it in three phases across multiple deployments:

-- SAFE: Adding a nullable column is fast (metadata-only in PostgreSQL 11+)
ALTER TABLE orders ADD COLUMN status VARCHAR(50);

This is fast because the column is nullable with no default. Existing rows have NULL for the new column.

# Application code: write to BOTH old and new columns
def create_order(data: dict):
    db.execute("""
        INSERT INTO orders (customer_id, total, old_status, status)
        VALUES ($1, $2, $3, $4)
    """, data['customer_id'], data['total'],
               data['old_status'], map_status(data['old_status']))

Deploy this code change first. Now both columns are being written.

# Backfill script — runs in batches to avoid locking
def backfill_status(batch_size: int = 1000):
    offset = 0
    while True:
        rows = db.execute("""
            SELECT id, old_status FROM orders
            WHERE status IS NULL
            LIMIT $1 OFFSET $2
        """, batch_size, offset)

        if not rows:
            break

        for row in rows:
            db.execute("""
                UPDATE orders SET status = $1 WHERE id = $2
            """, map_status(row['old_status']), row['id'])

        offset += batch_size
        time.sleep(0.1)  # Don't overwhelm the database

Run this as a background job. It may take hours for large tables, but it doesn’t block normal operations.

Once the backfill is complete and all application code reads from the new column:

-- SAFE: The old column is no longer used
ALTER TABLE orders DROP COLUMN old_status;

Deploy this in a separate release, after confirming no code references the old column.

Adding an index on a large table can lock writes. The solution:

-- PostgreSQL: CREATE INDEX CONCURRENTLY doesn't block writes
CREATE INDEX CONCURRENTLY idx_orders_customer_id
    ON orders (customer_id);

# In your migration tool
class Migration:
    def up(self):
        # CONCURRENTLY is essential for zero-downtime
        self.execute("CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id)")

    def down(self):
        self.execute("DROP INDEX CONCURRENTLY idx_orders_customer_id")

Note: CONCURRENTLY takes longer than a regular index creation, but it doesn’t block writes.

Column renames require the full expand-contract cycle:

-- Phase 1: Add new column
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Deploy code that writes to both, reads from new

# Code reads from new, falls back to old
def get_display_name(user: dict) -> str:
    return user.get('display_name') or user.get('name')

-- Phase 2: Backfill
UPDATE users SET display_name = name WHERE display_name IS NULL;

-- Phase 3: Drop old column (in a later deployment)
ALTER TABLE users DROP COLUMN name;

Every migration must be reversible:

class AddOrderStatus(Migration):
    def up(self):
        self.execute("ALTER TABLE orders ADD COLUMN status VARCHAR(50)")

    def down(self):
        # Only safe if no data has been written to the new column
        # Otherwise, you'll lose data
        self.execute("ALTER TABLE orders DROP COLUMN status")

For irreversible migrations (like deleting a column with data), the rollback requires restoring from backup — which is why we avoid destructive operations.

We use a custom migration runner that enforces safety rules:

class SafeMigrationRunner:
    DANGEROUS_OPERATIONS = [
        'DROP TABLE',
        'DROP COLUMN',
        'ALTER TABLE.*SET DEFAULT',  # Rewrites all rows
        'CREATE INDEX ',  # Without CONCURRENTLY
    ]

    def validate(self, migration: Migration):
        sql = migration.sql
        for pattern in self.DANGEROUS_OPERATIONS:
            if re.search(pattern, sql, re.IGNORECASE):
                if not migration.has_safety_flag:
                    raise UnsafeMigrationError(
                        f"Migration contains dangerous operation: {pattern}. "
                        f"Add safety flag or split into expand-contract phases."
                    )

Never block writes in production — if a migration locks tables, it’s the wrong migration
Expand-contract is slower but safer — accept the extra deployments
Backfill in batches — never update millions of rows in a single transaction
Test migrations on production-sized data — a migration that works on 1K rows may fail on 50M
Have a rollback plan — every migration should be reversible without data loss

Questions about database migrations? Find me on GitHub or Twitter.

Database Sharding Strategies

Scaling databases with horizontal partitioning techniques.

backend database sharding

GraphQL vs REST in 2026

Comparing GraphQL and REST APIs in modern application development.

backend graphql rest

Message Queue Comparison

Comparing Kafka, RabbitMQ, and Redis Streams for different use cases.

backend kafka rabbitmq

Related Posts

Database Sharding Strategies

GraphQL vs REST in 2026

Message Queue Comparison