Published on

Circuit Breaker Not Triggering — When Your Safety Net Has Holes

Authors

Introduction

You implemented a circuit breaker. The downstream payment service is failing at 80%. But the circuit is still CLOSED — your service keeps hammering the failing dependency. The "safety net" is doing nothing.

Circuit breakers are one of the most misconfigured reliability patterns in distributed systems.

Why Circuit Breakers Fail to Open

Problem 1: Threshold Too High

// ❌ Circuit opens only after 50 failures — too slow
const breaker = new CircuitBreaker(fn, {
  errorThresholdPercentage: 50,
  volumeThreshold: 50,  // Need 50 requests to even evaluate
  resetTimeout: 30_000,
})

// Reality: by the time 50 requests fail, you've already caused serious damage
// At 100 req/s, that's 500ms of cascading failures before breaker opens
// ✅ Lower thresholds, smaller windows
const breaker = new CircuitBreaker(fn, {
  errorThresholdPercentage: 25,  // Open after 25% failure rate
  volumeThreshold: 10,            // Evaluate after just 10 requests
  timeout: 2000,                  // Count requests taking >2s as failures
  resetTimeout: 15_000,           // Try half-open sooner
})

Problem 2: Not Counting Timeouts as Failures

// ❌ Default opossum behavior — only HTTP errors counted
// Slow responses (10s timeouts) not counted as failures
const breaker = new CircuitBreaker(
  () => fetch('http://payment-service/charge'),
  { errorThresholdPercentage: 50 }
  // No timeout configured — hanging requests aren't "errors"
)

// The service is slow (10s response), not throwing errors
// Circuit sees 0% error rate — stays CLOSED — queue backs up
// ✅ Configure timeout — slow = failure
const breaker = new CircuitBreaker(
  () => fetch('http://payment-service/charge'),
  {
    timeout: 3000,                   // 3s timeout counts as failure
    errorThresholdPercentage: 25,
    volumeThreshold: 10,
  }
)

Problem 3: Per-Instance State (Not Shared)

// ❌ Each server process has its OWN circuit breaker state
// Server 1 circuit: OPEN (saw failures)
// Server 2 circuit: CLOSED (hasn't seen enough)
// Server 3 circuit: CLOSED
// Net effect: 2/3 of requests still go to failing service

class PaymentService {
  private breaker = new CircuitBreaker(this.charge, options)
  // This breaker lives only in this process's memory
}
// ✅ Option 1: Shared state via Redis
class RedisCircuitBreaker {
  private readonly FAILURE_KEY: string
  private readonly STATE_KEY: string

  constructor(
    private redis: Redis,
    private serviceName: string,
    private options = {
      failureThreshold: 10,
      failureWindowMs: 60_000,
      openDurationMs: 30_000,
      successThreshold: 3,
    }
  ) {
    this.FAILURE_KEY = `cb:failures:${serviceName}`
    this.STATE_KEY = `cb:state:${serviceName}`
  }

  async getState(): Promise<'CLOSED' | 'OPEN' | 'HALF_OPEN'> {
    const state = await this.redis.get(this.STATE_KEY)
    return (state as any) ?? 'CLOSED'
  }

  async recordFailure(): Promise<void> {
    const pipe = this.redis.pipeline()
    pipe.incr(this.FAILURE_KEY)
    pipe.pexpire(this.FAILURE_KEY, this.options.failureWindowMs)
    const [[, count]] = await pipe.exec() as any

    if (count >= this.options.failureThreshold) {
      await this.redis.set(
        this.STATE_KEY,
        'OPEN',
        'PX',
        this.options.openDurationMs
      )
      console.error(`Circuit OPENED for ${this.serviceName}${count} failures`)
    }
  }

  async recordSuccess(): Promise<void> {
    await this.redis.del(this.FAILURE_KEY)
    const state = await this.getState()

    if (state === 'HALF_OPEN') {
      const successes = await this.redis.incr(`cb:successes:${this.serviceName}`)
      if (successes >= this.options.successThreshold) {
        await this.redis.del(this.STATE_KEY)
        await this.redis.del(`cb:successes:${this.serviceName}`)
        console.log(`Circuit CLOSED for ${this.serviceName} — recovered`)
      }
    }
  }

  async call<T>(fn: () => Promise<T>): Promise<T> {
    const state = await this.getState()

    if (state === 'OPEN') {
      throw new CircuitOpenError(`${this.serviceName} circuit is OPEN`)
    }

    try {
      const result = await fn()
      await this.recordSuccess()
      return result
    } catch (err) {
      await this.recordFailure()
      throw err
    }
  }
}

Problem 4: Different Error Types Not Classified

// ❌ All errors treated equally
// 404 Not Found counts as a failure — but it's a client bug, not a service outage
// 400 Bad Request counted as failure — wrong!

// ✅ Only count errors that indicate service degradation
class SmartCircuitBreaker {
  private shouldCount(error: any): boolean {
    // Count: 5xx errors, timeouts, connection refused
    if (error.status >= 500) return true
    if (error.code === 'ECONNREFUSED') return true
    if (error.code === 'ETIMEDOUT') return true
    if (error.name === 'TimeoutError') return true

    // Don't count: client errors (4xx), validation failures
    if (error.status >= 400 && error.status < 500) return false

    return true
  }

  async call<T>(fn: () => Promise<T>): Promise<T> {
    try {
      return await fn()
    } catch (err) {
      if (this.shouldCount(err)) {
        await this.recordFailure()
      }
      throw err
    }
  }
}

Fix: Production-Ready Circuit Breaker

import CircuitBreaker from 'opossum'

function createBreaker<T>(
  name: string,
  fn: (...args: any[]) => Promise<T>
) {
  const breaker = new CircuitBreaker(fn, {
    name,
    timeout: 3000,                  // Fail requests > 3s
    errorThresholdPercentage: 25,   // Open at 25% error rate
    volumeThreshold: 10,            // Need 10 requests to evaluate
    resetTimeout: 15_000,           // Try half-open after 15s
    errorFilter: (err) => {
      // Don't count client errors against the breaker
      return err.status >= 400 && err.status < 500
    },
  })

  // Monitoring
  breaker.on('open', () => {
    console.error(`[CircuitBreaker] ${name} OPENED`)
    metrics.increment(`circuit_breaker.open`, { service: name })
    alerting.trigger(`Circuit breaker opened: ${name}`)
  })

  breaker.on('halfOpen', () => {
    console.warn(`[CircuitBreaker] ${name} HALF-OPEN — testing`)
    metrics.increment(`circuit_breaker.half_open`, { service: name })
  })

  breaker.on('close', () => {
    console.log(`[CircuitBreaker] ${name} CLOSED — recovered`)
    metrics.increment(`circuit_breaker.closed`, { service: name })
  })

  breaker.on('fallback', (result) => {
    metrics.increment(`circuit_breaker.fallback`, { service: name })
  })

  // Graceful fallback
  breaker.fallback(() => {
    throw new ServiceUnavailableError(`${name} is currently unavailable`)
  })

  return breaker
}

// Usage
const paymentBreaker = createBreaker('payment-service', chargePayment)
const inventoryBreaker = createBreaker('inventory-service', checkStock)

// Health endpoint exposes circuit state
app.get('/health/circuits', (req, res) => {
  res.json({
    payment: paymentBreaker.status.stats,
    inventory: inventoryBreaker.status.stats,
  })
})

Monitoring Circuit Breaker State

// Track how often circuits open — leading indicator of service health
app.get('/metrics', (req, res) => {
  const stats = paymentBreaker.status.stats

  res.json({
    state: paymentBreaker.opened ? 'OPEN' : 'CLOSED',
    failures: stats.failures,
    successes: stats.successes,
    timeouts: stats.timeouts,
    fallbacks: stats.fallbacks,
    rejects: stats.rejects,  // Requests rejected while circuit open
    percentile: {
      p50: stats.latencyMean,
      p99: stats.percentiles['99'],
    },
  })
})

Circuit Breaker Configuration Guide

ScenarioerrorThresholdPercentagevolumeThresholdtimeoutresetTimeout
Critical payment service10%52s30s
Non-critical recommendations50%205s10s
External third-party API25%103s60s
Internal microservice20%101s15s
Database connection30%55s30s

Conclusion

Circuit breakers fail when thresholds are too high, timeouts aren't counted, state isn't shared across instances, or client errors are wrongly counted as service failures. A properly tuned circuit breaker uses a low volume threshold (10 requests) to evaluate quickly, counts timeouts as failures, filters out 4xx client errors, shares state via Redis across all instances, and fires alerts the moment it opens. Without these, your circuit breaker is theater — it looks safe but provides no protection.