Published on

Split Brain Scenario — When Your Cluster Can't Agree on Who's in Charge

Authors

Introduction

Your Redis cluster has 3 nodes. A network switch fails. Nodes 1 and 2 can communicate, but not with Node 3. Node 3 promotes itself to primary (can't reach the others). Meanwhile, Nodes 1 and 2 also have a primary.

Two primaries accepting writes. Network heals. Which writes win?

Why Split Brain Happens

Normal state:
  [Primary] ←sync→ [Replica 1] ←sync→ [Replica 2]

Network partition:
  [Primary][Replica 1] ←sync→ [Replica 2]

  Replica 1 + 2 detect primary unreachable
  They elect Replica 1 as new primary
  Now: [OLD Primary (accepts writes)] AND [NEW Primary (accepts writes)]

Network heals:
  Both primaries have conflicting data
  One must be demoted — but which writes are lost?

Fix 1: Quorum — Require Majority Agreement

// The core principle: any operation requires agreement from a majority (quorum)
// With 3 nodes, quorum = 2 (majority)
// Neither partition of 1 node can form quorum alone

// PostgreSQL with Patroni: configure synchronous replication quorum
// patroni.yml:
// bootstrap:
//   dcs:
//     synchronous_mode: true
//     synchronous_mode_strict: false
// postgresql:
//   parameters:
//     synchronous_standby_names: 'ANY 1 (replica1, replica2)'
//     # Write must be acknowledged by at least 1 replica
//     # If no replicas reachable → primary refuses writes → no split brain

// Redis Sentinel: configure quorum
// sentinel.conf:
// sentinel monitor mymaster 127.0.0.1 6379 2  # Quorum = 2
// sentinel down-after-milliseconds mymaster 5000
// sentinel failover-timeout mymaster 10000
// # Failover only happens if 2/3 sentinels agree primary is down
// Application level: verify you're talking to the actual primary
// Use Redis SENTINEL to discover current primary (not hardcoded host)

import { Redis } from 'ioredis'

const redis = new Redis({
  sentinels: [
    { host: 'sentinel-1', port: 26379 },
    { host: 'sentinel-2', port: 26379 },
    { host: 'sentinel-3', port: 26379 },
  ],
  name: 'mymaster',  // Sentinel discovers current primary
  sentinelRetryStrategy: (times) => Math.min(times * 100, 3000),
})

// Redis client automatically follows failover — connects to correct primary

Fix 2: Fencing Tokens — Prevent Stale Primaries from Writing

// Even after failover, old primary might still think it's primary
// Fencing tokens ensure old primary's writes are rejected

class FencedStorage {
  private currentToken: number = 0

  // New primary must acquire a token (monotonically increasing)
  async acquireLeadership(): Promise<number> {
    // Use etcd/ZooKeeper for distributed token
    // Token increases with each election
    const token = await etcd.put('/lock/primary', {
      value: Date.now().toString(),
      lease: 10,  // 10 second lease — must renew
    })
    this.currentToken = token.epoch
    return this.currentToken
  }

  async write(key: string, value: any, fenceToken: number): Promise<void> {
    if (fenceToken < this.currentToken) {
      // This write came from an old (now demoted) primary
      throw new StaleLeaderError(
        `Write rejected: token ${fenceToken} < current ${this.currentToken}`
      )
    }
    await storage.set(key, value)
  }
}

// Old primary tries to write after being demoted:
// fenceToken=1 < currentToken=2 → write rejected
// New primary writes with fenceToken=2 → accepted

Fix 3: PostgreSQL Replication Configuration

# postgresql.conf — prevent isolated primary from accepting writes
# When primary loses connection to ALL replicas, pause accepting writes

synchronous_standby_names = 'ANY 1 (replica1, replica2)'
# Write only succeeds when at least 1 replica confirms it

# If no replicas available, primary enters pause mode
# This prevents split brain at the cost of availability during partition

# wal_level = replica (minimum for streaming replication)
# max_wal_senders = 10
# wal_keep_size = 1GB  # Keep enough WAL for replica to catch up
// Detect if PostgreSQL is primary or replica before writing
async function ensurePrimary(client: pg.Client): Promise<void> {
  const result = await client.query('SELECT pg_is_in_recovery()')
  const isReplica = result.rows[0].pg_is_in_recovery

  if (isReplica) {
    throw new Error('Cannot write to replica — this is not the primary')
  }
}

// Write path always verifies it's talking to primary
async function writeWithPrimaryCheck(data: any) {
  await ensurePrimary(primaryClient)
  await primaryClient.query('INSERT INTO events...', data)
}

Fix 4: Detecting Split Brain in Redis

// Redis doesn't prevent split brain by default
// Use MIN-REPLICAS to prevent isolated primary from accepting writes

// redis.conf:
// min-replicas-to-write 1       # Primary requires >= 1 replica connected
// min-replicas-max-lag 10       # Replica must be within 10s of primary

// If primary loses all replicas, it refuses writes
// This means: during partition, isolated primary becomes read-only
// Prevents split brain at cost of write unavailability

// Also: use WAIT command to verify replication before responding
await redis.set('critical:data', value)
await redis.wait(1, 1000)  // Wait for 1 replica to confirm, timeout 1000ms
// If wait returns 0, your write might not be replicated

Fix 5: Application-Level Optimistic Concurrency

// Even if your infrastructure prevents split brain,
// design your application to detect and handle conflicts

interface DocumentWithVersion {
  id: string
  version: number  // Timestamp or vector clock
  data: any
}

class ConflictAwareStore {
  async save(doc: DocumentWithVersion): Promise<DocumentWithVersion> {
    const existing = await this.db.findById(doc.id)

    if (existing && existing.version > doc.version) {
      // Conflict: existing document is newer than what we're trying to save
      throw new ConflictError({
        message: 'Document has been modified since you last read it',
        currentVersion: existing.version,
        yourVersion: doc.version,
        currentData: existing.data,
        yourData: doc.data,
      })
    }

    return this.db.save({ ...doc, version: Date.now() })
  }
}

// Let caller decide how to resolve conflict
async function updateWithConflictResolution(id: string, update: any) {
  try {
    return await store.save({ id, ...update })
  } catch (err) {
    if (err instanceof ConflictError) {
      // Last-write-wins (simple)
      const merged = { ...err.currentData, ...err.yourData }
      return store.save({ id, data: merged, version: err.currentVersion + 1 })

      // Or: reject and ask user to resolve (safest for critical data)
    }
    throw err
  }
}

Split Brain Prevention Checklist

  • ✅ Odd number of nodes (3, 5) — enables majority quorum
  • ✅ Sentinel/Patroni/etcd for leader election with quorum
  • synchronous_standby_names to require replica acknowledgment
  • ✅ Fencing tokens to reject writes from demoted primaries
  • min-replicas-to-write in Redis for write safety
  • ✅ Application-level version columns for conflict detection
  • ✅ Monitor replication lag — alert if lag > threshold
  • ✅ Chaos testing: simulate network partition annually

Conclusion

Split brain is rare but catastrophic — two primaries accepting conflicting writes is one of the hardest failure modes to recover from. Prevent it with quorum-based consensus: any operation requires acknowledgment from a majority of nodes, so neither partition can independently make progress. Use synchronous replication to require replica confirmation on critical writes. Implement fencing tokens to reject stale primary writes after failover. And design your application with version columns that detect conflicts before they corrupt your data. The right choice is almost always to sacrifice availability during a partition rather than risk data corruption.