Published on

Schema Change Breaking Older Services — When Your Database Migration Breaks Half the Fleet

Authors

Introduction

In a rolling deployment, old and new versions of your service run at the same time — sometimes for minutes, sometimes for hours if you have many instances or a blue/green setup. Any schema change that isn't backward-compatible will break the old version the moment you run it.

The Problem

Deployment timeline:

T=0:00  Old code: SELECT user_name FROM users  (column: user_name)
T=0:01  Migration runs: ALTER TABLE users RENAME COLUMN user_name TO full_name
T=0:02  New code: SELECT full_name FROM users  ✅
T=0:02  Old code (still running): SELECT user_name FROM users  ❌ ERROR
         → column "user_name" does not exist
T=0:08  Rolling deploy finishes — old code replaced
T=0:02-T=0:08: 6 minutes of errors

The root problem: the migration ran before all old instances were replaced.

The Expand/Contract Pattern for Schema Changes

Safe schema evolution happens in multiple phases across multiple deployments:

Renaming a Column

Wrong approach (single deployment):
  Phase 1: Run migration: RENAME COLUMN user_name → full_name
           Deploy new code that uses full_name
  Problem: Old code breaks during rolling deploy

Right approach (three deployments):

Phase 1: ExpandAdd new column, keep old column
  Migration: ALTER TABLE users ADD COLUMN full_name VARCHAR;
  Code: Read from user_name, write to BOTH user_name AND full_name
  Deploy this code. Old code still works (user_name exists).
  Old code: reads user_name ✅ | New code: reads full_name ✅

Phase 2: BackfillCopy old data to new column
  UPDATE users SET full_name = user_name WHERE full_name IS NULL;

Phase 3: ContractDrop old column (after all old code is gone)
  Migration: ALTER TABLE users DROP COLUMN user_name;
  Code: Read and write only full_name
  Deploy. Old code is gone. Safe to drop.

Removing a Column

Wrong:
  Remove column in migration + remove from code in same deploy

Right:
  Deploy 1: Remove column from code (stop reading/writing it)
            (Existing column still in DB — old code still works, new code ignores it)
  Deploy 2: Remove column from DB
            (No code touches it anymore — safe to drop)

Adding a NOT NULL Column

-- ❌ Wrong: single migration breaks old code that doesn't set this column
ALTER TABLE orders ADD COLUMN priority VARCHAR NOT NULL DEFAULT 'normal';
-- Old code doesn't know about priority — inserts fail!

-- ✅ Right: three phases
-- Phase 1: Add with default (nullable)
ALTER TABLE orders ADD COLUMN priority VARCHAR DEFAULT 'normal';

-- Phase 2: Update code to write priority, backfill existing rows
UPDATE orders SET priority = 'normal' WHERE priority IS NULL;

-- Phase 3: Add NOT NULL constraint (after all old code is gone)
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL;

Backward-Compatible Column Changes

Some changes are safe in a single deployment:

-- ✅ Safe: adding a nullable column (old code ignores it)
ALTER TABLE users ADD COLUMN avatar_url VARCHAR;

-- ✅ Safe: widening a VARCHAR (old code values still fit)
ALTER TABLE users ALTER COLUMN username TYPE VARCHAR(255);

-- ✅ Safe: adding a new table (old code doesn't know about it, that's fine)
CREATE TABLE user_preferences (...);

-- ❌ Unsafe: narrowing a VARCHAR (old code might write values that are too long)
ALTER TABLE users ALTER COLUMN username TYPE VARCHAR(20);

-- ❌ Unsafe: renaming (old code uses old name)
ALTER TABLE users RENAME COLUMN email TO email_address;

-- ❌ Unsafe: dropping a column (old code still reads it)
ALTER TABLE users DROP COLUMN legacy_field;

-- ❌ Unsafe: adding NOT NULL without default (old code doesn't set the column)
ALTER TABLE users ADD COLUMN required_field VARCHAR NOT NULL;

-- ❌ Unsafe: changing column type incompatibly
ALTER TABLE users ALTER COLUMN age TYPE VARCHAR;

Version-Gating Migrations

For teams that can't always do multi-phase deploys, code-gate the schema change:

// Feature flag: new code uses new column, old code uses old column
// Both exist simultaneously until rollout is complete

async function getUserName(userId: string) {
  const user = await db.query('SELECT user_name, full_name FROM users WHERE id = $1', [userId])

  // Use full_name if available (new column), fall back to user_name (old column)
  return user.full_name ?? user.name
}

async function updateUserName(userId: string, name: string) {
  // Write to both columns during transition
  await db.query(
    'UPDATE users SET user_name = $1, full_name = $1 WHERE id = $2',
    [name, userId]
  )
}

Contract Testing for Database Schema

Catch schema compatibility issues in CI before they hit production:

// schema-contract.test.ts
// Verifies old code still works after new migration runs

describe('Schema backward compatibility', () => {
  it('old queries still work after migration', async () => {
    // Run the migration
    await runMigrations()

    // Simulate old code queries
    const oldCodeQuery = 'SELECT user_name, email FROM users WHERE id = $1'
    const result = await db.query(oldCodeQuery, [testUserId])

    // Old query should still work
    expect(result.rows[0]).toHaveProperty('user_name')
  })
})

Schema Change Checklist

  • ✅ Never rename or drop columns in the same deploy as the code change
  • ✅ Use expand/contract: add new column → backfill → drop old column (3 deploys)
  • ✅ New columns must have defaults or be nullable (old code can't set unknown columns)
  • ✅ Test migrations against old code version in CI
  • ✅ Review migrations with backward-compatibility in mind before merging
  • ✅ Document multi-phase migrations — note which deploys are prerequisites

Conclusion

Rolling deployments and schema changes are fundamentally at odds unless you follow the expand/contract pattern. The rule is: the database must be compatible with both the current and previous version of the code at all times. This means adding columns before the code that uses them, and removing columns only after all code that references them is gone. It's more work — three deploys instead of one — but it's the only way to change schema without breaking running instances.