Zero-Downtime Database Migrations
Strategies for running database migrations without service interruptions.
Database migrations are the most dangerous part of any deployment. A bad migration can lock tables for hours, corrupt data, or take down your entire application. At scale, you can’t afford downtime — which means you can’t afford traditional migrations.
We’ve deployed 200+ schema changes to production with zero downtime using the expand-contract pattern. Here’s how it works.
A traditional migration looks like this:
-- DANGEROUS: This locks the orders table during migrationALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';On a table with 50M rows, this ALTER TABLE can take 30+ minutes. During that time:
- Writes are blocked
- Reads may be blocked (depending on the database)
- Your application is degraded or down
Instead of changing the schema in one step, we do it in three phases across multiple deployments:
-- SAFE: Adding a nullable column is fast (metadata-only in PostgreSQL 11+)ALTER TABLE orders ADD COLUMN status VARCHAR(50);This is fast because the column is nullable with no default. Existing rows have NULL for the new column.
# Application code: write to BOTH old and new columnsdef create_order(data: dict): db.execute(""" INSERT INTO orders (customer_id, total, old_status, status) VALUES ($1, $2, $3, $4) """, data['customer_id'], data['total'], data['old_status'], map_status(data['old_status']))Deploy this code change first. Now both columns are being written.
# Backfill script — runs in batches to avoid lockingdef backfill_status(batch_size: int = 1000): offset = 0 while True: rows = db.execute(""" SELECT id, old_status FROM orders WHERE status IS NULL LIMIT $1 OFFSET $2 """, batch_size, offset)
if not rows: break
for row in rows: db.execute(""" UPDATE orders SET status = $1 WHERE id = $2 """, map_status(row['old_status']), row['id'])
offset += batch_size time.sleep(0.1) # Don't overwhelm the databaseRun this as a background job. It may take hours for large tables, but it doesn’t block normal operations.
Once the backfill is complete and all application code reads from the new column:
-- SAFE: The old column is no longer usedALTER TABLE orders DROP COLUMN old_status;Deploy this in a separate release, after confirming no code references the old column.
Adding an index on a large table can lock writes. The solution:
-- PostgreSQL: CREATE INDEX CONCURRENTLY doesn't block writesCREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);# In your migration toolclass Migration: def up(self): # CONCURRENTLY is essential for zero-downtime self.execute("CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id)")
def down(self): self.execute("DROP INDEX CONCURRENTLY idx_orders_customer_id")Note: CONCURRENTLY takes longer than a regular index creation, but it doesn’t block writes.
Column renames require the full expand-contract cycle:
-- Phase 1: Add new columnALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Deploy code that writes to both, reads from new# Code reads from new, falls back to olddef get_display_name(user: dict) -> str: return user.get('display_name') or user.get('name')-- Phase 2: BackfillUPDATE users SET display_name = name WHERE display_name IS NULL;
-- Phase 3: Drop old column (in a later deployment)ALTER TABLE users DROP COLUMN name;Every migration must be reversible:
class AddOrderStatus(Migration): def up(self): self.execute("ALTER TABLE orders ADD COLUMN status VARCHAR(50)")
def down(self): # Only safe if no data has been written to the new column # Otherwise, you'll lose data self.execute("ALTER TABLE orders DROP COLUMN status")For irreversible migrations (like deleting a column with data), the rollback requires restoring from backup — which is why we avoid destructive operations.
We use a custom migration runner that enforces safety rules:
class SafeMigrationRunner: DANGEROUS_OPERATIONS = [ 'DROP TABLE', 'DROP COLUMN', 'ALTER TABLE.*SET DEFAULT', # Rewrites all rows 'CREATE INDEX ', # Without CONCURRENTLY ]
def validate(self, migration: Migration): sql = migration.sql for pattern in self.DANGEROUS_OPERATIONS: if re.search(pattern, sql, re.IGNORECASE): if not migration.has_safety_flag: raise UnsafeMigrationError( f"Migration contains dangerous operation: {pattern}. " f"Add safety flag or split into expand-contract phases." )- Never block writes in production — if a migration locks tables, it’s the wrong migration
- Expand-contract is slower but safer — accept the extra deployments
- Backfill in batches — never update millions of rows in a single transaction
- Test migrations on production-sized data — a migration that works on 1K rows may fail on 50M
- Have a rollback plan — every migration should be reversible without data loss
Questions about database migrations? Find me on GitHub or Twitter.