Published on

Health Check Patterns — Liveness, Readiness, and Deep Dependency Checks

Authors

Introduction

Health checks drive container restarts, traffic routing, and alerts. Shallow health checks (just "are you running?") miss real problems. Deep health checks (test every dependency) can cause cascading failures. The sweet spot is testing what matters locally—your database connections, message queue—and failing gracefully when dependencies are slow or down. We'll explore liveness vs readiness probes, dependency aggregation, and caching to prevent thundering herd.

Kubernetes Liveness vs Readiness vs Startup Probes

Liveness: Is the service still alive? If it fails repeatedly, Kubernetes restarts the pod. Readiness: Can the service handle traffic? If it fails, Kubernetes removes it from the load balancer. Startup: Has the service finished initializing? Blocks traffic until startup succeeds.

import express from 'express';

class HealthCheckServer {
  private app = express();
  private isReady = false;
  private isDead = false;

  constructor() {
    this.setupRoutes();
  }

  private setupRoutes(): void {
    // Liveness probe: returns 200 if still alive, 500 if dead
    // Kubernetes restarts if this fails 3 times in a row
    this.app.get('/healthz', async (req, res) => {
      if (this.isDead) {
        return res.status(500).json({ error: 'Service is dead' });
      }

      // Simple check: can we access our own memory?
      try {
        const memUsage = process.memoryUsage();
        if (memUsage.heapUsed > memUsage.heapTotal * 0.95) {
          this.isDead = true;
          return res.status(500).json({ error: 'Out of memory' });
        }
        return res.status(200).json({ status: 'alive' });
      } catch (error) {
        return res.status(500).json({ error: 'Unknown error' });
      }
    });

    // Readiness probe: can we serve traffic?
    // Kubernetes removes from load balancer if this fails
    this.app.get('/ready', async (req, res) => {
      if (!this.isReady) {
        return res.status(503).json({ error: 'Not ready' });
      }

      // Check critical dependencies
      const health = await this.checkDependencies();
      if (!health.database.ok || !health.cache.ok) {
        return res.status(503).json({ error: 'Dependencies down', health });
      }

      return res.status(200).json({ status: 'ready', health });
    });

    // Startup probe: has initialization completed?
    // Kubernetes waits for this before using liveness/readiness probes
    this.app.get('/startup', async (req, res) => {
      if (!this.isReady) {
        return res.status(503).json({ error: 'Still initializing' });
      }
      return res.status(200).json({ status: 'started' });
    });
  }

  async initialize(): Promise<void> {
    try {
      // Connect to database
      await this.db.connect();

      // Load configuration
      await this.loadConfig();

      // Warm up caches
      await this.warmCache();

      this.isReady = true;
      console.log('Service ready');
    } catch (error) {
      console.error('Initialization failed:', error);
      throw error;
    }
  }

  private async checkDependencies(): Promise<{
    database: { ok: boolean; latency: number };
    cache: { ok: boolean; latency: number };
  }> {
    const start = Date.now();

    try {
      // Quick database check
      await this.db.query('SELECT 1');
      const dbLatency = Date.now() - start;

      // Quick cache check
      const cacheStart = Date.now();
      await this.cache.ping();
      const cacheLatency = Date.now() - cacheStart;

      return {
        database: { ok: true, latency: dbLatency },
        cache: { ok: true, latency: cacheLatency },
      };
    } catch (error) {
      return {
        database: { ok: false, latency: Date.now() - start },
        cache: { ok: false, latency: 0 },
      };
    }
  }

  private async loadConfig(): Promise<void> {
    // Implementation
  }

  private async warmCache(): Promise<void> {
    // Implementation
  }

  private db: any;
  private cache: any;
}

Shallow vs Deep Health Checks

Shallow: Just check if the service process is running (CPU, memory, can bind to port). Deep: Query all dependencies (database, cache, queue, external APIs).

class HealthCheckStrategy {
  // BAD: Deep health check causes cascading failures
  async deepHealthCheck(): Promise<HealthStatus> {
    const results = await Promise.all([
      this.testDatabase(),
      this.testRedisCache(),
      this.testKafka(),
      this.testExternalAPI(),
      this.testS3Storage(),
    ]);

    return {
      ok: results.every(r => r.ok),
      dependencies: results,
    };
  }

  // GOOD: Shallow + critical dependencies only
  async smartHealthCheck(): Promise<HealthStatus> {
    // Always check: critical to our operation
    const critical = await Promise.all([
      this.testDatabase(),
      this.testLocalCache(),
    ]);

    // Check optionally: don't fail readiness if down
    const optional = await Promise.allSettled([
      this.testExternalAPI(),
      this.testS3Storage(),
    ]);

    return {
      ok: critical.every(r => r.ok),
      critical: critical,
      optional: optional
        .filter(r => r.status === 'fulfilled')
        .map((r: any) => r.value),
    };
  }

  // BEST: Cache health check result; avoid thundering herd
  private cachedHealth: HealthStatus | null = null;
  private lastCheck = 0;
  private cacheInterval = 5000; // 5 seconds

  async cachedHealthCheck(): Promise<HealthStatus> {
    const now = Date.now();

    // Return cached result if recent
    if (this.cachedHealth && now - this.lastCheck < this.cacheInterval) {
      return this.cachedHealth;
    }

    // Perform check asynchronously
    this.performHealthCheckAsync().catch(err => {
      console.error('Background health check failed:', err);
      // Keep serving with cached result even if fresh check fails
    });

    // Return cached or default
    return (
      this.cachedHealth || {
        ok: true,
        cached: true,
        lastCheck: this.lastCheck,
      }
    );
  }

  private async performHealthCheckAsync(): Promise<void> {
    const [db, cache] = await Promise.all([
      this.testDatabase(),
      this.testLocalCache(),
    ]);

    this.cachedHealth = {
      ok: db.ok && cache.ok,
      database: db,
      cache: cache,
    };
    this.lastCheck = Date.now();
  }

  private async testDatabase(): Promise<{ ok: boolean; latency: number }> {
    const start = Date.now();
    try {
      await this.db.query('SELECT 1');
      return { ok: true, latency: Date.now() - start };
    } catch {
      return { ok: false, latency: Date.now() - start };
    }
  }

  private async testLocalCache(): Promise<{ ok: boolean; latency: number }> {
    const start = Date.now();
    try {
      await this.cache.ping();
      return { ok: true, latency: Date.now() - start };
    } catch {
      return { ok: false, latency: Date.now() - start };
    }
  }

  private async testKafka(): Promise<{ ok: boolean; latency: number }> {
    const start = Date.now();
    try {
      await this.kafka.admin().connect();
      await this.kafka.admin().disconnect();
      return { ok: true, latency: Date.now() - start };
    } catch {
      return { ok: false, latency: Date.now() - start };
    }
  }

  private async testExternalAPI(): Promise<{ ok: boolean; latency: number }> {
    const start = Date.now();
    try {
      const response = await fetch('https://api.example.com/health', {
        timeout: 2000,
      });
      return { ok: response.ok, latency: Date.now() - start };
    } catch {
      return { ok: false, latency: Date.now() - start };
    }
  }

  private async testS3Storage(): Promise<{ ok: boolean; latency: number }> {
    const start = Date.now();
    try {
      await this.s3.headBucket({ Bucket: 'my-bucket' });
      return { ok: true, latency: Date.now() - start };
    } catch {
      return { ok: false, latency: Date.now() - start };
    }
  }

  private db: any;
  private cache: any;
  private kafka: any;
  private s3: any;
}

interface HealthStatus {
  ok: boolean;
  database?: { ok: boolean; latency: number };
  cache?: { ok: boolean; latency: number };
  critical?: Array<{ ok: boolean; latency: number }>;
  optional?: Array<{ ok: boolean; latency: number }>;
  cached?: boolean;
  lastCheck?: number;
}

Dependency Health Aggregation

Aggregate health of all dependencies and report overall status.

class DependencyHealthAggregator {
  private dependencies: Map<
    string,
    { checker: () => Promise<boolean>; critical: boolean }
  > = new Map();

  registerDependency(
    name: string,
    checker: () => Promise<boolean>,
    critical = false
  ): void {
    this.dependencies.set(name, { checker, critical });
  }

  async getHealthStatus(): Promise<AggregatedHealth> {
    const checks = Array.from(this.dependencies.entries()).map(
      async ([name, { checker, critical }]) => {
        const start = Date.now();
        try {
          const ok = await Promise.race([
            checker(),
            this.createTimeout(2000),
          ]);
          return {
            name,
            ok,
            latency: Date.now() - start,
            critical,
            timedOut: false,
          };
        } catch (error) {
          return {
            name,
            ok: false,
            latency: Date.now() - start,
            critical,
            timedOut: true,
            error: (error as Error).message,
          };
        }
      }
    );

    const results = await Promise.all(checks);

    // Aggregate
    const criticalFailed = results.filter(r => r.critical && !r.ok);
    const overallOk = criticalFailed.length === 0;

    return {
      ok: overallOk,
      timestamp: new Date().toISOString(),
      dependencies: results,
      critical: criticalFailed.length === 0,
      issues: criticalFailed.map(r => `${r.name} is down`),
    };
  }

  private createTimeout(ms: number): Promise<boolean> {
    return new Promise((_, reject) => {
      setTimeout(() => reject(new Error('Timeout')), ms);
    });
  }
}

interface AggregatedHealth {
  ok: boolean;
  timestamp: string;
  dependencies: Array<{
    name: string;
    ok: boolean;
    latency: number;
    critical: boolean;
    timedOut: boolean;
    error?: string;
  }>;
  critical: boolean;
  issues: string[];
}

// Usage
const aggregator = new DependencyHealthAggregator();
aggregator.registerDependency('database', () => db.query('SELECT 1'), true);
aggregator.registerDependency(
  'redis',
  () => redis.ping(),
  true
);
aggregator.registerDependency('s3', () => s3.headBucket(), false);
aggregator.registerDependency(
  'external-api',
  () => fetch('https://api.example.com/health'),
  false
);

app.get('/health', async (req, res) => {
  const health = await aggregator.getHealthStatus();
  res.status(health.ok ? 200 : 503).json(health);
});

Health Check as Operational Signal

Health checks should signal operational state, not just service availability.

class OperationalHealthCheck {
  private state: 'HEALTHY' | 'DEGRADED' | 'FAILING' = 'HEALTHY';
  private degradationReason = '';

  async checkAndReport(): Promise<HealthReport> {
    const cpuUsage = process.cpuUsage();
    const memUsage = process.memoryUsage();
    const uptime = process.uptime();

    // Detect issues
    const heapUsagePercent = (memUsage.heapUsed / memUsage.heapTotal) * 100;
    const isMemoryHighUsage = heapUsagePercent > 80;
    const isMemoryCritical = heapUsagePercent > 95;

    // Check response latency
    const avgLatency = await this.getAverageLatency();
    const isLatencyHigh = avgLatency > 500;

    // Update state
    if (isMemoryCritical) {
      this.state = 'FAILING';
      this.degradationReason = 'Critical memory usage';
    } else if (isMemoryHighUsage || isLatencyHigh) {
      this.state = 'DEGRADED';
      this.degradationReason = [
        isMemoryHighUsage ? 'High memory usage' : '',
        isLatencyHigh ? 'High latency' : '',
      ]
        .filter(Boolean)
        .join(', ');
    } else {
      this.state = 'HEALTHY';
      this.degradationReason = '';
    }

    return {
      status: this.state,
      reason: this.degradationReason,
      metrics: {
        heapUsagePercent,
        uptime,
        averageLatency: avgLatency,
      },
      recommendation:
        this.state === 'FAILING'
          ? 'Schedule restart immediately'
          : this.state === 'DEGRADED'
            ? 'Monitor and prepare for restart'
            : 'All systems normal',
    };
  }

  private async getAverageLatency(): Promise<number> {
    // Implementation: return average request latency
    return 0;
  }
}

interface HealthReport {
  status: 'HEALTHY' | 'DEGRADED' | 'FAILING';
  reason: string;
  metrics: {
    heapUsagePercent: number;
    uptime: number;
    averageLatency: number;
  };
  recommendation: string;
}

Health Check Caching to Avoid Thundering Herd

Many instances checking dependencies simultaneously can overwhelm them.

class CachedHealthChecker {
  private cache = new Map<string, { result: boolean; timestamp: number }>();
  private cacheInterval = 5000; // 5 seconds
  private checking = new Map<string, Promise<boolean>>();

  async checkHealth(key: string, checker: () => Promise<boolean>): Promise<boolean> {
    const cached = this.cache.get(key);
    const now = Date.now();

    // Return cached if fresh
    if (cached && now - cached.timestamp < this.cacheInterval) {
      return cached.result;
    }

    // If already checking, return that promise
    if (this.checking.has(key)) {
      return this.checking.get(key)!;
    }

    // Perform check
    const promise = checker()
      .then(result => {
        this.cache.set(key, { result, timestamp: now });
        this.checking.delete(key);
        return result;
      })
      .catch(error => {
        // Keep last known state on error
        const lastKnown = cached?.result ?? true;
        this.checking.delete(key);
        return lastKnown;
      });

    this.checking.set(key, promise);
    return promise;
  }
}

// Usage with jitter to prevent thundering herd
class HealthCheckWithJitter {
  private baseInterval = 10000; // 10 seconds
  private jitterRange = 2000; // +/- 1 second

  startHealthChecks(checker: () => Promise<void>): void {
    // Randomize start time
    const startDelay = Math.random() * this.jitterRange;

    setTimeout(() => {
      checker();
      // Then check periodically with jitter
      setInterval(() => {
        const jitter = (Math.random() - 0.5) * this.jitterRange;
        const interval = this.baseInterval + jitter;
        setTimeout(checker, interval);
      }, this.baseInterval);
    }, startDelay);
  }
}

Graceful Degradation

Don't fail completely when a dependency is slow; degrade gracefully.

class GracefulDegradation {
  async handleWithFallback<T>(
    primary: () => Promise<T>,
    fallback: () => Promise<T>,
    timeout = 2000
  ): Promise<T> {
    try {
      return await Promise.race([
        primary(),
        this.createTimeout(timeout),
      ]);
    } catch (error) {
      console.warn('Primary failed, using fallback:', error);
      return fallback();
    }
  }

  async handleWithDegradation<T>(
    operation: () => Promise<T>,
    degradedVersion: () => Promise<Partial<T>>,
    timeout = 2000
  ): Promise<T | Partial<T>> {
    try {
      return await Promise.race([
        operation(),
        this.createTimeout(timeout),
      ]);
    } catch (error) {
      console.warn('Operation timed out, returning degraded response');
      return degradedVersion();
    }
  }

  private createTimeout(ms: number): Promise<never> {
    return new Promise((_, reject) => {
      setTimeout(() => reject(new Error('Timeout')), ms);
    });
  }
}

// Example: Product catalog with graceful degradation
class ProductCatalog {
  async getProduct(productId: string): Promise<Product> {
    return this.degradation.handleWithDegradation(
      () => this.getFullProduct(productId), // Full product with recommendations
      () => this.getBasicProduct(productId) // Just ID, name, price
    );
  }

  private async getFullProduct(productId: string): Promise<Product> {
    const [product, reviews, recommendations] = await Promise.all([
      this.db.query('SELECT * FROM products WHERE id = $1', [productId]),
      this.reviewService.getReviews(productId),
      this.recommendationEngine.getRelated(productId),
    ]);

    return { ...product, reviews, recommendations };
  }

  private async getBasicProduct(productId: string): Promise<Partial<Product>> {
    const product = await this.db.query('SELECT id, name, price FROM products WHERE id = $1', [
      productId,
    ]);
    return product;
  }

  private degradation = new GracefulDegradation();
  private db: any;
  private reviewService: any;
  private recommendationEngine: any;
}

interface Product {
  id: string;
  name: string;
  price: number;
  reviews?: any[];
  recommendations?: any[];
}

Checklist

  • Implement liveness, readiness, and startup probes for Kubernetes
  • Check only critical dependencies in readiness probes
  • Cache health check results to prevent thundering herd
  • Set timeouts on dependency checks (2-3 seconds max)
  • Aggregate health across dependencies intelligently
  • Use graceful degradation for non-critical features
  • Signal operational issues (memory, CPU) in health responses
  • Monitor health check response times
  • Test health checks under load
  • Document what each health check actually validates

Conclusion

Health checks are your system's heartbeat. They drive orchestration decisions, alerting, and failover. Keep them simple and fast—test only what's necessary to know you can serve traffic. Cache results to avoid overwhelming dependencies. When a dependency is slow, degrade gracefully rather than failing completely. And always remember: a health check is an operational signal, not just a yes/no answer.