Published on

Cache Stampede — When Your Cache Fix Breaks Everything

Authors

Introduction

You added Redis caching to protect your database. Traffic is smooth, response times are great. Then at exactly 3:00 AM — when the TTL on your most popular cache key expires — your database CPU spikes to 100%, latency shoots through the roof, and your on-call phone rings.

Welcome to cache stampede.

What is Cache Stampede?

Cache stampede (also called dog-piling) happens when:

  1. A heavily-used cache key expires (TTL hit)
  2. Many requests arrive simultaneously before anyone has repopulated it
  3. All of them find the cache empty — a cache miss
  4. All of them query the database at the same time
  5. The database gets overwhelmed, latency spikes, and everything breaks
TIME 03:00:00.000 — cache key "homepage:feed" expires

03:00:00.001Request #1   hits cache → MISS → queries DB
03:00:00.002Request #2   hits cache → MISS → queries DB
03:00:00.003Request #3   hits cache → MISS → queries DB
...
03:00:00.100Request #500 hits cache → MISS → queries DB

500 simultaneous DB queries for the same data 💥

Why It's So Dangerous

At 10 req/s, a 500ms DB query is fine. But at 500 simultaneous identical queries, connection pools exhaust, query queue backs up, latency cascades to every other endpoint, and your system falls over — all from one expired cache key.

Fix 1: Mutex / Distributed Lock

Only one request is allowed to rebuild the cache. All others wait:

import Redis from 'ioredis'

const redis = new Redis()

async function getWithMutex<T>(
  key: string,
  ttl: number,
  fetchFn: () => Promise<T>
): Promise<T> {
  // 1. Try cache first
  const cached = await redis.get(key)
  if (cached) return JSON.parse(cached)

  const lockKey = `lock:${key}`
  const lockValue = `${Date.now()}-${Math.random()}`

  // 2. Try to acquire lock (SET NX EX = atomic)
  const acquired = await redis.set(lockKey, lockValue, 'EX', 10, 'NX')

  if (acquired) {
    try {
      // 3. Lock acquired — we fetch and populate
      const data = await fetchFn()
      await redis.set(key, JSON.stringify(data), 'EX', ttl)
      return data
    } finally {
      // 4. Release lock (only if we own it)
      const current = await redis.get(lockKey)
      if (current === lockValue) await redis.del(lockKey)
    }
  } else {
    // 5. Lock taken — wait and retry
    await new Promise(r => setTimeout(r, 50))
    return getWithMutex(key, ttl, fetchFn)  // Retry
  }
}

// Usage
const feed = await getWithMutex(
  'homepage:feed',
  300,
  () => db.posts.findLatest(20)
)

Downside: All waiting requests are blocked until the one holder finishes. Fine for most cases.

Fix 2: Probabilistic Early Recomputation (XFetch)

Instead of waiting for expiry, proactively recompute the cache before it expires. Requests near the TTL deadline probabilistically decide to refresh early:

interface CacheEntry<T> {
  data: T
  delta: number   // Time it took to compute (ms)
  expiry: number  // Unix timestamp in ms
}

async function xfetch<T>(
  key: string,
  ttl: number,
  beta: number = 1,   // Higher = more aggressive early refresh
  fetchFn: () => Promise<T>
): Promise<T> {
  const raw = await redis.get(key)

  if (raw) {
    const entry: CacheEntry<T> = JSON.parse(raw)
    const now = Date.now()
    const timeToExpiry = entry.expiry - now

    // XFetch formula: recompute if this random check triggers
    const shouldRecompute =
      timeToExpiry - beta * entry.delta * Math.log(Math.random()) < 0

    if (!shouldRecompute) return entry.data
  }

  // Recompute
  const start = Date.now()
  const data = await fetchFn()
  const delta = Date.now() - start

  const entry: CacheEntry<T> = {
    data,
    delta,
    expiry: Date.now() + ttl * 1000,
  }

  await redis.set(key, JSON.stringify(entry), 'EX', ttl)
  return data
}

Advantage: No blocking, no lock contention. Requests naturally spread out the recomputation.

Fix 3: Stale-While-Revalidate

Serve stale data immediately while refreshing in the background:

interface SWREntry<T> {
  data: T
  cachedAt: number
  staleAfter: number   // Serve fresh up to this age (ms)
  deleteAfter: number  // Hard-delete after this age (ms)
}

async function getStaleWhileRevalidate<T>(
  key: string,
  freshTTL: number,   // e.g. 60s
  staleTTL: number,   // e.g. 300s
  fetchFn: () => Promise<T>
): Promise<T> {
  const raw = await redis.get(key)
  const now = Date.now()

  if (raw) {
    const entry: SWREntry<T> = JSON.parse(raw)
    const age = now - entry.cachedAt

    if (age < entry.staleAfter) {
      // Fresh — return immediately
      return entry.data
    }

    if (age < entry.deleteAfter) {
      // Stale — return immediately AND refresh in background
      refreshInBackground(key, freshTTL, staleTTL, fetchFn)
      return entry.data  // No wait!
    }
  }

  // Expired — must fetch synchronously
  return await refresh(key, freshTTL, staleTTL, fetchFn)
}

async function refreshInBackground<T>(
  key: string,
  freshTTL: number,
  staleTTL: number,
  fetchFn: () => Promise<T>
) {
  // Don't await — fire and forget
  refresh(key, freshTTL, staleTTL, fetchFn).catch(console.error)
}

async function refresh<T>(
  key: string,
  freshTTL: number,
  staleTTL: number,
  fetchFn: () => Promise<T>
): Promise<T> {
  const data = await fetchFn()
  const entry: SWREntry<T> = {
    data,
    cachedAt: Date.now(),
    staleAfter: freshTTL * 1000,
    deleteAfter: staleTTL * 1000,
  }
  await redis.set(key, JSON.stringify(entry), 'EX', staleTTL)
  return data
}

Fix 4: Jitter on TTL

The simplest fix — add random jitter to TTLs so keys don't all expire simultaneously:

function setWithJitter(key: string, data: any, baseTTL: number) {
  // Add ±10% random jitter
  const jitter = baseTTL * 0.1 * (Math.random() * 2 - 1)
  const ttl = Math.floor(baseTTL + jitter)
  return redis.set(key, JSON.stringify(data), 'EX', ttl)
}

// Keys expire at different times — no synchronized stampede
await setWithJitter('user:1:profile', userData, 300)
await setWithJitter('user:2:profile', userData, 300)
// Key 1 → expires at 285s
// Key 2 → expires at 318s
// Key 3 → expires at 302s

This is the quickest fix if you're already seeing stampedes.

Combining the Strategies

In production, use multiple layers:

const cacheService = {
  async get<T>(key: string, ttl: number, fetchFn: () => Promise<T>): Promise<T> {
    // Layer 1: SWR for zero-wait serving
    // Layer 2: Mutex to prevent parallel DB queries
    // Layer 3: Jitter on TTL to spread expiries
    return getStaleWhileRevalidate(key, ttl, ttl * 5, () =>
      getWithMutex(key + ':lock', ttl, fetchFn)
    )
  }
}

Monitoring for Stampedes

// Track cache miss rate — sudden spikes signal stampedes
const cacheMetrics = {
  hits: 0,
  misses: 0,

  recordHit() { this.hits++ },
  recordMiss() { this.misses++ },

  hitRate() {
    const total = this.hits + this.misses
    return total === 0 ? 1 : this.hits / total
  }
}

// Alert if hit rate drops below 80%
if (cacheMetrics.hitRate() < 0.8) {
  logger.alert('Cache hit rate dropped — possible stampede')
}

Conclusion

Cache stampede is insidious because it happens at your most critical moments — high traffic, popular content, right after a deployment. The fixes range from simple (add TTL jitter) to robust (SWR + mutex). For any cache key hit more than 100 times/second, implement at least one of these strategies. Your database will thank you.