- Published on
Leader Election Gone Wrong — When Two Nodes Both Think They're in Charge
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Leader election sounds simple: pick one node to be in charge, have others stand by. The hard part is handling the moment when the leader goes quiet — is it dead, or just slow? If you declare a new leader too early, you have two leaders. If you wait too long, work stops. Most homegrown election implementations get this wrong in production.
- The Problem
- Fix 1: Redis Lease with Strict Expiry
- Fix 2: Fencing Tokens
- Fix 3: Kubernetes Leader Election (Production-Ready)
- Fix 4: etcd for Strong Consistency
- The Lease Duration Trade-off
- Leader Election Checklist
- Conclusion
The Problem
3-node cluster. Node 1 is elected leader.
T=0:00 Node 1 is leader, processing job queue
T=0:10 Network partition: Node 1 can't reach Nodes 2 and 3
T=0:15 Nodes 2 and 3 declare Node 1 dead — elect Node 2 as leader
T=0:15 Node 1 still thinks it's leader (it can reach the database!)
T=0:20 Both Node 1 AND Node 2 pull jobs from queue
T=0:20 Same invoice processed twice → customer charged twice
T=0:25 Network heals. Now what?
The core problem is that "is the leader healthy?" is a distributed question with no perfect answer. Every election algorithm is a trade-off between availability and consistency.
Fix 1: Redis Lease with Strict Expiry
The simplest correct approach: a leader holds a short-lived Redis key. If it can't renew the key, it stops working — even if it "feels" healthy.
import { Redis } from 'ioredis'
class LeaderLease {
private isLeader = false
private leaseKey: string
private leaseTTL: number // milliseconds
private renewTimer: NodeJS.Timeout | null = null
constructor(
private redis: Redis,
private instanceId: string,
options: { name: string; ttlMs: number } = { name: 'leader', ttlMs: 15_000 }
) {
this.leaseKey = `leader:${options.name}`
this.leaseTTL = options.ttlMs
}
async start(): Promise<void> {
await this.tryAcquire()
// Renew at 1/3 of TTL to give 2 renewal attempts before expiry
this.renewTimer = setInterval(() => this.renew(), this.leaseTTL / 3)
}
private async tryAcquire(): Promise<void> {
const acquired = await this.redis.set(
this.leaseKey,
this.instanceId,
'PX', // millisecond precision
this.leaseTTL,
'NX' // only set if key doesn't exist
)
if (acquired) {
this.isLeader = true
console.log(`[Leader] ${this.instanceId} acquired lease`)
await this.onBecameLeader()
}
}
private async renew(): Promise<void> {
if (!this.isLeader) {
// Not leader — try to acquire
await this.tryAcquire()
return
}
// Renew only if we still own the lease (Lua script for atomicity)
const renewed = await this.redis.eval(
`
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('pexpire', KEYS[1], ARGV[2])
else
return 0
end
`,
1,
this.leaseKey,
this.instanceId,
String(this.leaseTTL)
)
if (renewed === 0) {
// Lost the lease (Redis restart? Eviction? Another node won?)
console.warn(`[Leader] ${this.instanceId} lost lease — stepping down`)
this.isLeader = false
await this.onLostLeadership()
}
}
getIsLeader(): boolean {
return this.isLeader
}
// Override these in subclasses or pass as options
protected async onBecameLeader(): Promise<void> {}
protected async onLostLeadership(): Promise<void> {}
stop(): void {
if (this.renewTimer) clearInterval(this.renewTimer)
this.isLeader = false
}
}
The critical safety rule: a node must stop doing leader work the moment it can't renew its lease, even if the local process is healthy. Redis being unreachable = no longer leader.
class SafeLeader extends LeaderLease {
private jobTimer: NodeJS.Timeout | null = null
protected async onBecameLeader() {
// Start doing leader-only work
this.jobTimer = setInterval(() => this.processJobQueue(), 10_000)
}
protected async onLostLeadership() {
// STOP doing leader-only work immediately
if (this.jobTimer) clearInterval(this.jobTimer)
this.jobTimer = null
}
private async processJobQueue() {
// Double-check at the start of every critical operation
if (!this.getIsLeader()) return
const jobs = await db.job.findPending()
for (const job of jobs) {
// Check again before each job — lease could expire mid-loop
if (!this.getIsLeader()) break
await processJob(job)
}
}
}
Fix 2: Fencing Tokens
Even with lease renewal, there's a window where an old leader might be mid-operation when it loses the lease. A fencing token prevents it from completing that operation.
// Fencing: every lease acquisition returns a monotonically increasing token
// The token must be presented when writing — database rejects stale tokens
class FencedLeaderLease {
private fencingToken: number = 0
async acquire(): Promise<number | null> {
// Lua: increment counter and set key atomically
const result = await this.redis.eval(
`
local token = redis.call('incr', KEYS[1] .. ':token')
local set = redis.call('set', KEYS[1], ARGV[1], 'PX', ARGV[2], 'NX')
if set then
redis.call('set', KEYS[1] .. ':token:' .. ARGV[1], token)
return token
else
return nil
end
`,
1,
this.leaseKey,
this.instanceId,
String(this.leaseTTL)
)
if (result) {
this.fencingToken = result as number
return this.fencingToken
}
return null
}
getToken(): number { return this.fencingToken }
}
// In the job processor — include fencing token in every write
async function processJobWithFencing(jobId: string, token: number) {
// Database enforces: only accept writes if token > last seen token
await db.query(`
UPDATE jobs
SET status = 'complete', processed_by_token = $1
WHERE id = $2
AND (processed_by_token IS NULL OR processed_by_token < $1)
`, [token, jobId])
// If 0 rows affected → stale leader, another leader already processed this
}
Fencing tokens turn "maybe processed twice" into "definitely processed once." Even if the old leader wins a race to the database, the write is rejected because its token is lower than the new leader's.
Fix 3: Kubernetes Leader Election (Production-Ready)
For Kubernetes deployments, use the built-in leader election via Lease objects — no Redis required:
import * as k8s from '@kubernetes/client-node'
async function runWithKubernetesLeaderElection(onLeader: () => void) {
const kc = new k8s.KubeConfig()
kc.loadFromDefault()
const coordinationClient = kc.makeApiClient(k8s.CoordinationV1Api)
const leaseName = 'my-service-leader'
const namespace = process.env.POD_NAMESPACE ?? 'default'
const identity = process.env.HOSTNAME ?? 'unknown'
const le = new k8s.LeaderElection(coordinationClient)
await le.run({
leaseName,
namespace,
identity,
leaseDuration: 15, // seconds
renewDeadline: 10, // must renew within 10s or step down
retryPeriod: 2, // retry every 2s
onStartedLeading: () => {
console.log(`${identity} became leader`)
onLeader()
},
onStoppedLeading: () => {
console.log(`${identity} lost leadership — exiting`)
process.exit(1) // Let Kubernetes restart the pod
},
onNewLeader: (newLeader: string) => {
console.log(`New leader elected: ${newLeader}`)
},
})
}
The Kubernetes Lease resource replaces Redis. The controller manager handles the renewal protocol, and the renewDeadline / leaseDuration parameters are tuned to prevent split-brain by default.
Fix 4: etcd for Strong Consistency
If you need linearizable leader election (stronger than Redis, which is eventually consistent under network partitions), use etcd:
import { Etcd3 } from 'etcd3'
const etcd = new Etcd3({ hosts: 'etcd:2379' })
async function runAsLeader() {
const election = etcd.election('my-service-leader')
// campaign() blocks until this instance becomes leader
// Uses etcd's Raft consensus — linearizable, no split-brain
const campaign = await election.campaign(process.env.HOSTNAME!)
console.log('Became leader!')
// When step down is needed:
// await campaign.resign()
// Watch for leadership changes
election.observe().on('change', (leader) => {
console.log('Current leader:', leader)
})
}
runAsLeader().catch((err) => {
console.error('Lost election or error:', err)
process.exit(1)
})
etcd uses the Raft consensus algorithm — it's impossible for two nodes to both believe they're leader at the same time (unlike Redis, which can have clock-based issues in network partition scenarios).
The Lease Duration Trade-off
Short lease (5 seconds):
✅ Fast failover — new leader elected in ~5 seconds
❌ More renewal network traffic
❌ Noisy — brief Redis hiccup causes unnecessary leader change
❌ Old leader gets 5 seconds where it might do stale work
Long lease (60 seconds):
✅ Robust to transient network issues
✅ Less renewal traffic
❌ Slow failover — 60 seconds of downtime if leader dies
❌ Old leader has 60 seconds to do stale work (unless using fencing)
A good default for most services:
- leaseTTL: 15 seconds
- renewInterval: 5 seconds (renew at 1/3 TTL)
- failover time: ~15 seconds (one full TTL expiry)
- Add fencing tokens if even 15 seconds of split-brain is unacceptable
Leader Election Checklist
| Risk | Solution |
|---|---|
| Two leaders simultaneously | Fencing tokens + lease per-operation check |
| Leader dies, no failover | TTL on lease — auto-expires and enables new election |
| Redis restart → all leases lost | Handle onLostLeadership, graceful shutdown |
| Leader can't renew but is healthy | Stop all leader work when renewal fails |
| Split-brain during partition | etcd or K8s Lease (Raft-based, linearizable) |
| Stale leader finishes long operation | Fencing token rejected by database |
Conclusion
Leader election bugs are silent in development (single node, no partitions) and catastrophic in production (duplicate billing, corrupted state, split-brain). The minimum viable safe implementation has three properties: a short-lived lease that auto-expires, strict step-down when renewal fails (even if the node is otherwise healthy), and fencing tokens on any writes so stale leaders can't corrupt state. For Kubernetes deployments, use the built-in Lease resource rather than rolling your own. For stronger guarantees, use etcd's Raft-based election. Whatever approach you choose, test the network-partition scenario — that's where elections go wrong.