Database Migrations in Production — Zero-Downtime Schema Changes at Scale

Introduction

Schema changes that block production cause outages. ALTER TABLE locks tables. Adding a column to a 100GB table takes hours. This post covers production-safe migration patterns: the expand/contract pattern, instant columns, online schema change tools, and rolling deployments that never block.

Why ALTER TABLE Locks Tables
Expand/Contract Pattern for Zero-Downtime Changes
PostgreSQL 11+ Instant Column Addition
Online Schema Changes with gh-ost
pg_repack for Online Index Rebuilds
Migration Testing with Production Data Subset
Rolling Deployments with Backward-Compatible Schemas
Migration Monitoring
Database Migrations Checklist
Conclusion

Why ALTER TABLE Locks Tables

PostgreSQL's ALTER TABLE acquires an exclusive lock, blocking reads and writes:

-- This blocks everything for 10 seconds on a 100GB table
ALTER TABLE orders ADD COLUMN status TEXT DEFAULT 'pending';

-- Real impact:
-- 1. Lock acquired (milliseconds)
-- 2. New column added to table structure
-- 3. Entire table rewritten (1-100GB takes 5-300 seconds)
-- 4. All READ and WRITE operations blocked during step 3
-- 5. Lock released

-- Check current locks
SELECT
  pid, usename, application_name, state, query,
  EXTRACT(EPOCH FROM (NOW() - query_start)) AS duration_sec
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration_sec DESC;

-- Find blocking queries
SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_catalog.pg_locks l1
JOIN pg_catalog.pg_locks l2 ON l1.locktype = l2.locktype
  AND l1.database IS NOT DISTINCT FROM l2.database
  AND l1.relation IS NOT DISTINCT FROM l2.relation
  AND l1.page IS NOT DISTINCT FROM l2.page
  AND l1.tuple IS NOT DISTINCT FROM l2.tuple
  AND l1.virtualxid IS NOT DISTINCT FROM l2.virtualxid
  AND l1.transactionid IS NOT DISTINCT FROM l2.transactionid
  AND l1.classid IS NOT DISTINCT FROM l2.classid
  AND l1.objid IS NOT DISTINCT FROM l2.objid
  AND l1.objsubid IS NOT DISTINCT FROM l2.objsubid
  AND l1 <> l2
JOIN pg_catalog.pg_stat_activity AS blocked ON l1.pid = blocked.pid
JOIN pg_catalog.pg_stat_activity AS blocking ON l2.pid = blocking.pid
WHERE NOT l2.granted;

Lock duration scales with table size. A 10GB table takes 1-5 minutes.

Expand/Contract Pattern for Zero-Downtime Changes

Decouple schema changes from application deployment:

-- PHASE 1: EXPAND (backward compatible)
-- Add new column with default, application still uses old column
ALTER TABLE orders ADD COLUMN status_new TEXT DEFAULT 'pending';

-- Create index on new column (non-blocking with CONCURRENTLY)
CREATE INDEX CONCURRENTLY idx_orders_status_new ON orders(status_new);

-- Backfill old data (in batches to avoid table locks)
UPDATE orders
SET status_new = 'completed'
WHERE status = 'completed' AND status_new IS NULL
LIMIT 10000;

-- PHASE 2: DUAL WRITE (application change)
-- Application now writes to BOTH old and new columns
-- Old application version still works (reads from status_new OR status)

-- In application code:
// Before: write to status
UPDATE orders SET status = $1 WHERE id = $2;

// Now: write to both
UPDATE orders SET status = $1, status_new = $1 WHERE id = $2;

// During transition: read from new column first
const order = await db.query(
  `SELECT
    id, amount,
    COALESCE(status_new, status) AS status
  FROM orders
  WHERE id = $1`,
  [orderId]
);

-- PHASE 3: CONTRACT (remove old column)
-- Only after all application instances updated to use new column
-- Verify: check if any queries still read old column (with monitoring)

ALTER TABLE orders DROP COLUMN status;
DROP INDEX idx_orders_status;  -- Optional, space cleanup

The expand/contract pattern ensures zero downtime: expand is backward compatible, contract happens after full deployment.

PostgreSQL 11+ Instant Column Addition

PostgreSQL 11+ adds columns instantly (no table rewrite) with defaults:

-- INSTANT: PostgreSQL 11+
-- Added in ~1ms, no table lock
ALTER TABLE orders ADD COLUMN created_at TIMESTAMP DEFAULT NOW();

-- Check if column is indeed instant (no rewrite required)
SELECT
  schemaname, tablename, attname,
  atttypid, atthasdef, attgenerated
FROM pg_attribute
WHERE tablename = 'orders' AND attname = 'created_at';

-- NOT instant (table rewrite required):
-- - Column with non-constant DEFAULT (computed)
-- - Column with CHECK constraint
-- - Column NOT NULL without DEFAULT
-- - Column with GENERATED AS STORED

-- Example of non-instant (will rewrite table)
ALTER TABLE orders ADD COLUMN total_with_tax DECIMAL GENERATED ALWAYS AS (total * 1.08) STORED;
-- This WILL lock the table

-- Smart alternatives for non-instant columns
-- Option 1: Use computed column on SELECT (no storage cost)
-- SELECT id, amount, amount * 1.08 AS total_with_tax FROM orders;

-- Option 2: Trigger to populate (background job, not schema lock)
CREATE FUNCTION populate_total_with_tax() RETURNS TRIGGER AS $$
BEGIN
  NEW.total_with_tax := NEW.total * 1.08;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER tg_populate_total_with_tax
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW
EXECUTE FUNCTION populate_total_with_tax();

-- Option 3: Computed column on application side (if small computation)
const order = { id: 1, total: 100 };
order.total_with_tax = order.total * 1.08;

Instant columns are PostgreSQL 11+ magic. Use them whenever possible.

Online Schema Changes with gh-ost

gh-ost (GitHub's online schema change tool) performs ALTER TABLE without locking:

#!/bin/bash

# Install gh-ost
# https://github.com/github/gh-ost/releases

# Basic usage: ALTER a table with no locks
gh-ost \
  --user=app \
  --password='secret' \
  --host=primary.db.internal \
  --allow-on-master \
  --database=myapp \
  --table=orders \
  --alter="ADD COLUMN status TEXT DEFAULT 'pending'" \
  --execute

# How gh-ost works:
# 1. Create copy of table (orders_gho)
# 2. Alter the copy
# 3. Use triggers on original to sync changes
# 4. Copy all existing data to new table
# 5. Swap tables (atomic rename)
# 6. Clean up old table

# Complex alter example (add NOT NULL column)
gh-ost \
  --user=app \
  --password='secret' \
  --host=primary.db.internal \
  --database=myapp \
  --table=orders \
  --alter="ADD COLUMN status TEXT NOT NULL DEFAULT 'pending'" \
  --approve-renamed-table \
  --execute

# Test without executing (dry-run)
gh-ost \
  --user=app \
  --password='secret' \
  --host=primary.db.internal \
  --database=myapp \
  --table=orders \
  --alter="ADD COLUMN status TEXT" \
  --test-on-replica  # Run on replica first

# Monitor progress
# gh-ost outputs:
# 2026-03-15 10:30:45 Migrating myapp.orders
# 2026-03-15 10:30:45 Copy progress: 10%, ETA: 2m15s
# 2026-03-15 10:32:45 Copy progress: 100%
# 2026-03-15 10:32:45 Swapping tables...
# 2026-03-15 10:32:46 Migration complete

# Production best practices
gh-ost \
  --user=app \
  --password='secret' \
  --host=primary.db.internal \
  --database=myapp \
  --table=orders \
  --alter="ADD COLUMN status TEXT DEFAULT 'pending'" \
  --throttle-control-replicas=replica-1.db,replica-2.db \
  --max-load="Threads_running=25,Threads_connected=50" \
  --critical-load="Threads_running=100,Threads_connected=200" \
  --critical-load-hibernate-seconds=60 \
  --approve-renamed-table \
  --execute

# Throttling options:
# --max-load: pause migration if server load exceeds threshold
# --critical-load: pause with longer sleep if load critical
# --checkpoint-interval: pause every N seconds (for replication lag)
# --throttle-control-replicas: monitor replica lag, pause if lag > threshold

gh-ost handles complex schema changes on large tables with zero downtime.

pg_repack for Online Index Rebuilds

Rebuild bloated indexes without locking:

#!/bin/bash

# Install pg_repack (PostgreSQL extension)
# psql -d myapp -c "CREATE EXTENSION pg_repack;"

# Rebuild table (remove bloat)
pg_repack \
  --host=primary.db.internal \
  --username=app \
  --password \
  --database=myapp \
  --table=orders \
  --jobs=4  # Parallel jobs

# Rebuild index (remove bloat)
pg_repack \
  --host=primary.db.internal \
  --username=app \
  --database=myapp \
  --index=idx_orders_user_id

# Check bloat before repack
psql -d myapp -c "
SELECT
  schemaname, tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
  n_live_tup AS live_rows
FROM pg_stat_user_tables
WHERE tablename = 'orders';
"

# After repack (size reduced, performance improved)

Migration Testing with Production Data Subset

Test migrations on production-like data before deploying:

#!/bin/bash

# 1. Clone production database to staging
pg_basebackup \
  --host=primary.db.internal \
  --pguser=replication_user \
  --checkpoint=fast \
  --progress \
  --wal-method=stream \
  --format=tar \
  --gzip \
  > /mnt/backup/staging.tar.gz

# 2. Restore to staging
tar -xzf /mnt/backup/staging.tar.gz -C /var/lib/postgresql/staging/

# 3. Truncate large tables to subset
psql -d myapp_staging -c "
  DELETE FROM events WHERE created_at < NOW() - INTERVAL '30 days';
  DELETE FROM logs WHERE created_at < NOW() - INTERVAL '7 days';
"

# 4. Test migration
psql -d myapp_staging -f migrations/001_add_status.sql

# 5. Run integration tests
npm test -- --database=staging

# 6. Measure time and resource impact
EXPLAIN ANALYZE
SELECT COUNT(*) FROM orders;

# 7. Verify backward compatibility
# Run old application version against new schema
# Verify all queries still work

Rolling Deployments with Backward-Compatible Schemas

Deploy application and schema changes independently:

# 1. Deploy schema changes first (backwards compatible)
# apiVersion: batch/v1
# kind: Job
# metadata:
#   name: db-migration-expand
# spec:
#   template:
#     spec:
#       containers:
#       - name: migrate
#         image: flyway:8.3
#         command: ["flyway", "migrate"]
#         env:
#         - name: FLYWAY_URL
#           value: jdbc:postgresql://primary.db/myapp
#         - name: FLYWAY_LOCATIONS
#           value: filesystem:/migrations
#       restartPolicy: Never

# 2. Roll out new application gradually
# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: app
# spec:
#   replicas: 10
#   strategy:
#     type: RollingUpdate
#     rollingUpdate:
#       maxSurge: 2        # Add 2 pods at a time
#       maxUnavailable: 1  # Remove 1 pod at a time
#   template:
#     spec:
#       containers:
#       - name: app
#         image: myapp:v2.0.0-with-new-schema
#         readinessProbe:
#           httpGet:
#             path: /health
#             port: 8080
#           initialDelaySeconds: 5
#           periodSeconds: 5

# During rolling update:
# - 1-2 v1.0 pods still running (use old column)
# - 1-2 v2.0 pods deployed (use new column)
# - Migration: dual-write to both columns
# - Traffic gradually shifts to v2.0
# - v1.0 pods terminated one at a time
# - After 5 minutes: all v2.0

# 3. Contract old schema (remove old column)
# Deploy after confirming v2.0 stable for 24 hours

Migration Monitoring

Track migration health and catch issues early:

-- Monitor active migrations
SELECT
  pid, usename, application_name,
  query, query_start,
  EXTRACT(EPOCH FROM (NOW() - query_start)) AS duration_sec,
  wait_event_type
FROM pg_stat_activity
WHERE query LIKE '%ALTER TABLE%'
  OR query LIKE '%CREATE INDEX%'
  OR application_name LIKE '%gh-ost%';

-- Monitor replication lag (if using gh-ost on replica)
SELECT
  now() - pg_last_xact_replay_time() AS replication_lag;

-- Alert if lag > 10 seconds:
-- - Pause gh-ost (--checkpoint-interval)
-- - Investigate slow query on replica

-- Verify migration completeness
SELECT
  schemaname, tablename,
  CASE
    WHEN tablename LIKE '%_gho' THEN 'In-progress gh-ost'
    ELSE 'Normal table'
  END AS status
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY tablename;

Database Migrations Checklist

Conclusion

Zero-downtime schema changes require planning and the right tools. Use instant columns for PostgreSQL 11+. Apply the expand/contract pattern for complex changes. For large tables, use gh-ost or pg_repack to avoid locks entirely. Test migrations on production data subsets before deploying. Deploy schema changes first (backwards compatible), then roll out application updates gradually. Never do big-bang migrations during business hours. With these patterns, schema changes become just another part of your deployment pipeline.