- Published on
Cache Stampede — When Your Cache Fix Breaks Everything
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
You added Redis caching to protect your database. Traffic is smooth, response times are great. Then at exactly 3:00 AM — when the TTL on your most popular cache key expires — your database CPU spikes to 100%, latency shoots through the roof, and your on-call phone rings.
Welcome to cache stampede.
- What is Cache Stampede?
- Why It's So Dangerous
- Fix 1: Mutex / Distributed Lock
- Fix 2: Probabilistic Early Recomputation (XFetch)
- Fix 3: Stale-While-Revalidate
- Fix 4: Jitter on TTL
- Combining the Strategies
- Monitoring for Stampedes
- Conclusion
What is Cache Stampede?
Cache stampede (also called dog-piling) happens when:
- A heavily-used cache key expires (TTL hit)
- Many requests arrive simultaneously before anyone has repopulated it
- All of them find the cache empty — a cache miss
- All of them query the database at the same time
- The database gets overwhelmed, latency spikes, and everything breaks
TIME 03:00:00.000 — cache key "homepage:feed" expires
03:00:00.001 → Request #1 hits cache → MISS → queries DB
03:00:00.002 → Request #2 hits cache → MISS → queries DB
03:00:00.003 → Request #3 hits cache → MISS → queries DB
...
03:00:00.100 → Request #500 hits cache → MISS → queries DB
500 simultaneous DB queries for the same data 💥
Why It's So Dangerous
At 10 req/s, a 500ms DB query is fine. But at 500 simultaneous identical queries, connection pools exhaust, query queue backs up, latency cascades to every other endpoint, and your system falls over — all from one expired cache key.
Fix 1: Mutex / Distributed Lock
Only one request is allowed to rebuild the cache. All others wait:
import Redis from 'ioredis'
const redis = new Redis()
async function getWithMutex<T>(
key: string,
ttl: number,
fetchFn: () => Promise<T>
): Promise<T> {
// 1. Try cache first
const cached = await redis.get(key)
if (cached) return JSON.parse(cached)
const lockKey = `lock:${key}`
const lockValue = `${Date.now()}-${Math.random()}`
// 2. Try to acquire lock (SET NX EX = atomic)
const acquired = await redis.set(lockKey, lockValue, 'EX', 10, 'NX')
if (acquired) {
try {
// 3. Lock acquired — we fetch and populate
const data = await fetchFn()
await redis.set(key, JSON.stringify(data), 'EX', ttl)
return data
} finally {
// 4. Release lock (only if we own it)
const current = await redis.get(lockKey)
if (current === lockValue) await redis.del(lockKey)
}
} else {
// 5. Lock taken — wait and retry
await new Promise(r => setTimeout(r, 50))
return getWithMutex(key, ttl, fetchFn) // Retry
}
}
// Usage
const feed = await getWithMutex(
'homepage:feed',
300,
() => db.posts.findLatest(20)
)
Downside: All waiting requests are blocked until the one holder finishes. Fine for most cases.
Fix 2: Probabilistic Early Recomputation (XFetch)
Instead of waiting for expiry, proactively recompute the cache before it expires. Requests near the TTL deadline probabilistically decide to refresh early:
interface CacheEntry<T> {
data: T
delta: number // Time it took to compute (ms)
expiry: number // Unix timestamp in ms
}
async function xfetch<T>(
key: string,
ttl: number,
beta: number = 1, // Higher = more aggressive early refresh
fetchFn: () => Promise<T>
): Promise<T> {
const raw = await redis.get(key)
if (raw) {
const entry: CacheEntry<T> = JSON.parse(raw)
const now = Date.now()
const timeToExpiry = entry.expiry - now
// XFetch formula: recompute if this random check triggers
const shouldRecompute =
timeToExpiry - beta * entry.delta * Math.log(Math.random()) < 0
if (!shouldRecompute) return entry.data
}
// Recompute
const start = Date.now()
const data = await fetchFn()
const delta = Date.now() - start
const entry: CacheEntry<T> = {
data,
delta,
expiry: Date.now() + ttl * 1000,
}
await redis.set(key, JSON.stringify(entry), 'EX', ttl)
return data
}
Advantage: No blocking, no lock contention. Requests naturally spread out the recomputation.
Fix 3: Stale-While-Revalidate
Serve stale data immediately while refreshing in the background:
interface SWREntry<T> {
data: T
cachedAt: number
staleAfter: number // Serve fresh up to this age (ms)
deleteAfter: number // Hard-delete after this age (ms)
}
async function getStaleWhileRevalidate<T>(
key: string,
freshTTL: number, // e.g. 60s
staleTTL: number, // e.g. 300s
fetchFn: () => Promise<T>
): Promise<T> {
const raw = await redis.get(key)
const now = Date.now()
if (raw) {
const entry: SWREntry<T> = JSON.parse(raw)
const age = now - entry.cachedAt
if (age < entry.staleAfter) {
// Fresh — return immediately
return entry.data
}
if (age < entry.deleteAfter) {
// Stale — return immediately AND refresh in background
refreshInBackground(key, freshTTL, staleTTL, fetchFn)
return entry.data // No wait!
}
}
// Expired — must fetch synchronously
return await refresh(key, freshTTL, staleTTL, fetchFn)
}
async function refreshInBackground<T>(
key: string,
freshTTL: number,
staleTTL: number,
fetchFn: () => Promise<T>
) {
// Don't await — fire and forget
refresh(key, freshTTL, staleTTL, fetchFn).catch(console.error)
}
async function refresh<T>(
key: string,
freshTTL: number,
staleTTL: number,
fetchFn: () => Promise<T>
): Promise<T> {
const data = await fetchFn()
const entry: SWREntry<T> = {
data,
cachedAt: Date.now(),
staleAfter: freshTTL * 1000,
deleteAfter: staleTTL * 1000,
}
await redis.set(key, JSON.stringify(entry), 'EX', staleTTL)
return data
}
Fix 4: Jitter on TTL
The simplest fix — add random jitter to TTLs so keys don't all expire simultaneously:
function setWithJitter(key: string, data: any, baseTTL: number) {
// Add ±10% random jitter
const jitter = baseTTL * 0.1 * (Math.random() * 2 - 1)
const ttl = Math.floor(baseTTL + jitter)
return redis.set(key, JSON.stringify(data), 'EX', ttl)
}
// Keys expire at different times — no synchronized stampede
await setWithJitter('user:1:profile', userData, 300)
await setWithJitter('user:2:profile', userData, 300)
// Key 1 → expires at 285s
// Key 2 → expires at 318s
// Key 3 → expires at 302s
This is the quickest fix if you're already seeing stampedes.
Combining the Strategies
In production, use multiple layers:
const cacheService = {
async get<T>(key: string, ttl: number, fetchFn: () => Promise<T>): Promise<T> {
// Layer 1: SWR for zero-wait serving
// Layer 2: Mutex to prevent parallel DB queries
// Layer 3: Jitter on TTL to spread expiries
return getStaleWhileRevalidate(key, ttl, ttl * 5, () =>
getWithMutex(key + ':lock', ttl, fetchFn)
)
}
}
Monitoring for Stampedes
// Track cache miss rate — sudden spikes signal stampedes
const cacheMetrics = {
hits: 0,
misses: 0,
recordHit() { this.hits++ },
recordMiss() { this.misses++ },
hitRate() {
const total = this.hits + this.misses
return total === 0 ? 1 : this.hits / total
}
}
// Alert if hit rate drops below 80%
if (cacheMetrics.hitRate() < 0.8) {
logger.alert('Cache hit rate dropped — possible stampede')
}
Conclusion
Cache stampede is insidious because it happens at your most critical moments — high traffic, popular content, right after a deployment. The fixes range from simple (add TTL jitter) to robust (SWR + mutex). For any cache key hit more than 100 times/second, implement at least one of these strategies. Your database will thank you.