- Published on
Health Check Patterns — Liveness, Readiness, and Deep Dependency Checks
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Health checks drive container restarts, traffic routing, and alerts. Shallow health checks (just "are you running?") miss real problems. Deep health checks (test every dependency) can cause cascading failures. The sweet spot is testing what matters locally—your database connections, message queue—and failing gracefully when dependencies are slow or down. We'll explore liveness vs readiness probes, dependency aggregation, and caching to prevent thundering herd.
- Kubernetes Liveness vs Readiness vs Startup Probes
- Shallow vs Deep Health Checks
- Dependency Health Aggregation
- Health Check as Operational Signal
- Health Check Caching to Avoid Thundering Herd
- Graceful Degradation
- Checklist
- Conclusion
Kubernetes Liveness vs Readiness vs Startup Probes
Liveness: Is the service still alive? If it fails repeatedly, Kubernetes restarts the pod. Readiness: Can the service handle traffic? If it fails, Kubernetes removes it from the load balancer. Startup: Has the service finished initializing? Blocks traffic until startup succeeds.
import express from 'express';
class HealthCheckServer {
private app = express();
private isReady = false;
private isDead = false;
constructor() {
this.setupRoutes();
}
private setupRoutes(): void {
// Liveness probe: returns 200 if still alive, 500 if dead
// Kubernetes restarts if this fails 3 times in a row
this.app.get('/healthz', async (req, res) => {
if (this.isDead) {
return res.status(500).json({ error: 'Service is dead' });
}
// Simple check: can we access our own memory?
try {
const memUsage = process.memoryUsage();
if (memUsage.heapUsed > memUsage.heapTotal * 0.95) {
this.isDead = true;
return res.status(500).json({ error: 'Out of memory' });
}
return res.status(200).json({ status: 'alive' });
} catch (error) {
return res.status(500).json({ error: 'Unknown error' });
}
});
// Readiness probe: can we serve traffic?
// Kubernetes removes from load balancer if this fails
this.app.get('/ready', async (req, res) => {
if (!this.isReady) {
return res.status(503).json({ error: 'Not ready' });
}
// Check critical dependencies
const health = await this.checkDependencies();
if (!health.database.ok || !health.cache.ok) {
return res.status(503).json({ error: 'Dependencies down', health });
}
return res.status(200).json({ status: 'ready', health });
});
// Startup probe: has initialization completed?
// Kubernetes waits for this before using liveness/readiness probes
this.app.get('/startup', async (req, res) => {
if (!this.isReady) {
return res.status(503).json({ error: 'Still initializing' });
}
return res.status(200).json({ status: 'started' });
});
}
async initialize(): Promise<void> {
try {
// Connect to database
await this.db.connect();
// Load configuration
await this.loadConfig();
// Warm up caches
await this.warmCache();
this.isReady = true;
console.log('Service ready');
} catch (error) {
console.error('Initialization failed:', error);
throw error;
}
}
private async checkDependencies(): Promise<{
database: { ok: boolean; latency: number };
cache: { ok: boolean; latency: number };
}> {
const start = Date.now();
try {
// Quick database check
await this.db.query('SELECT 1');
const dbLatency = Date.now() - start;
// Quick cache check
const cacheStart = Date.now();
await this.cache.ping();
const cacheLatency = Date.now() - cacheStart;
return {
database: { ok: true, latency: dbLatency },
cache: { ok: true, latency: cacheLatency },
};
} catch (error) {
return {
database: { ok: false, latency: Date.now() - start },
cache: { ok: false, latency: 0 },
};
}
}
private async loadConfig(): Promise<void> {
// Implementation
}
private async warmCache(): Promise<void> {
// Implementation
}
private db: any;
private cache: any;
}
Shallow vs Deep Health Checks
Shallow: Just check if the service process is running (CPU, memory, can bind to port). Deep: Query all dependencies (database, cache, queue, external APIs).
class HealthCheckStrategy {
// BAD: Deep health check causes cascading failures
async deepHealthCheck(): Promise<HealthStatus> {
const results = await Promise.all([
this.testDatabase(),
this.testRedisCache(),
this.testKafka(),
this.testExternalAPI(),
this.testS3Storage(),
]);
return {
ok: results.every(r => r.ok),
dependencies: results,
};
}
// GOOD: Shallow + critical dependencies only
async smartHealthCheck(): Promise<HealthStatus> {
// Always check: critical to our operation
const critical = await Promise.all([
this.testDatabase(),
this.testLocalCache(),
]);
// Check optionally: don't fail readiness if down
const optional = await Promise.allSettled([
this.testExternalAPI(),
this.testS3Storage(),
]);
return {
ok: critical.every(r => r.ok),
critical: critical,
optional: optional
.filter(r => r.status === 'fulfilled')
.map((r: any) => r.value),
};
}
// BEST: Cache health check result; avoid thundering herd
private cachedHealth: HealthStatus | null = null;
private lastCheck = 0;
private cacheInterval = 5000; // 5 seconds
async cachedHealthCheck(): Promise<HealthStatus> {
const now = Date.now();
// Return cached result if recent
if (this.cachedHealth && now - this.lastCheck < this.cacheInterval) {
return this.cachedHealth;
}
// Perform check asynchronously
this.performHealthCheckAsync().catch(err => {
console.error('Background health check failed:', err);
// Keep serving with cached result even if fresh check fails
});
// Return cached or default
return (
this.cachedHealth || {
ok: true,
cached: true,
lastCheck: this.lastCheck,
}
);
}
private async performHealthCheckAsync(): Promise<void> {
const [db, cache] = await Promise.all([
this.testDatabase(),
this.testLocalCache(),
]);
this.cachedHealth = {
ok: db.ok && cache.ok,
database: db,
cache: cache,
};
this.lastCheck = Date.now();
}
private async testDatabase(): Promise<{ ok: boolean; latency: number }> {
const start = Date.now();
try {
await this.db.query('SELECT 1');
return { ok: true, latency: Date.now() - start };
} catch {
return { ok: false, latency: Date.now() - start };
}
}
private async testLocalCache(): Promise<{ ok: boolean; latency: number }> {
const start = Date.now();
try {
await this.cache.ping();
return { ok: true, latency: Date.now() - start };
} catch {
return { ok: false, latency: Date.now() - start };
}
}
private async testKafka(): Promise<{ ok: boolean; latency: number }> {
const start = Date.now();
try {
await this.kafka.admin().connect();
await this.kafka.admin().disconnect();
return { ok: true, latency: Date.now() - start };
} catch {
return { ok: false, latency: Date.now() - start };
}
}
private async testExternalAPI(): Promise<{ ok: boolean; latency: number }> {
const start = Date.now();
try {
const response = await fetch('https://api.example.com/health', {
timeout: 2000,
});
return { ok: response.ok, latency: Date.now() - start };
} catch {
return { ok: false, latency: Date.now() - start };
}
}
private async testS3Storage(): Promise<{ ok: boolean; latency: number }> {
const start = Date.now();
try {
await this.s3.headBucket({ Bucket: 'my-bucket' });
return { ok: true, latency: Date.now() - start };
} catch {
return { ok: false, latency: Date.now() - start };
}
}
private db: any;
private cache: any;
private kafka: any;
private s3: any;
}
interface HealthStatus {
ok: boolean;
database?: { ok: boolean; latency: number };
cache?: { ok: boolean; latency: number };
critical?: Array<{ ok: boolean; latency: number }>;
optional?: Array<{ ok: boolean; latency: number }>;
cached?: boolean;
lastCheck?: number;
}
Dependency Health Aggregation
Aggregate health of all dependencies and report overall status.
class DependencyHealthAggregator {
private dependencies: Map<
string,
{ checker: () => Promise<boolean>; critical: boolean }
> = new Map();
registerDependency(
name: string,
checker: () => Promise<boolean>,
critical = false
): void {
this.dependencies.set(name, { checker, critical });
}
async getHealthStatus(): Promise<AggregatedHealth> {
const checks = Array.from(this.dependencies.entries()).map(
async ([name, { checker, critical }]) => {
const start = Date.now();
try {
const ok = await Promise.race([
checker(),
this.createTimeout(2000),
]);
return {
name,
ok,
latency: Date.now() - start,
critical,
timedOut: false,
};
} catch (error) {
return {
name,
ok: false,
latency: Date.now() - start,
critical,
timedOut: true,
error: (error as Error).message,
};
}
}
);
const results = await Promise.all(checks);
// Aggregate
const criticalFailed = results.filter(r => r.critical && !r.ok);
const overallOk = criticalFailed.length === 0;
return {
ok: overallOk,
timestamp: new Date().toISOString(),
dependencies: results,
critical: criticalFailed.length === 0,
issues: criticalFailed.map(r => `${r.name} is down`),
};
}
private createTimeout(ms: number): Promise<boolean> {
return new Promise((_, reject) => {
setTimeout(() => reject(new Error('Timeout')), ms);
});
}
}
interface AggregatedHealth {
ok: boolean;
timestamp: string;
dependencies: Array<{
name: string;
ok: boolean;
latency: number;
critical: boolean;
timedOut: boolean;
error?: string;
}>;
critical: boolean;
issues: string[];
}
// Usage
const aggregator = new DependencyHealthAggregator();
aggregator.registerDependency('database', () => db.query('SELECT 1'), true);
aggregator.registerDependency(
'redis',
() => redis.ping(),
true
);
aggregator.registerDependency('s3', () => s3.headBucket(), false);
aggregator.registerDependency(
'external-api',
() => fetch('https://api.example.com/health'),
false
);
app.get('/health', async (req, res) => {
const health = await aggregator.getHealthStatus();
res.status(health.ok ? 200 : 503).json(health);
});
Health Check as Operational Signal
Health checks should signal operational state, not just service availability.
class OperationalHealthCheck {
private state: 'HEALTHY' | 'DEGRADED' | 'FAILING' = 'HEALTHY';
private degradationReason = '';
async checkAndReport(): Promise<HealthReport> {
const cpuUsage = process.cpuUsage();
const memUsage = process.memoryUsage();
const uptime = process.uptime();
// Detect issues
const heapUsagePercent = (memUsage.heapUsed / memUsage.heapTotal) * 100;
const isMemoryHighUsage = heapUsagePercent > 80;
const isMemoryCritical = heapUsagePercent > 95;
// Check response latency
const avgLatency = await this.getAverageLatency();
const isLatencyHigh = avgLatency > 500;
// Update state
if (isMemoryCritical) {
this.state = 'FAILING';
this.degradationReason = 'Critical memory usage';
} else if (isMemoryHighUsage || isLatencyHigh) {
this.state = 'DEGRADED';
this.degradationReason = [
isMemoryHighUsage ? 'High memory usage' : '',
isLatencyHigh ? 'High latency' : '',
]
.filter(Boolean)
.join(', ');
} else {
this.state = 'HEALTHY';
this.degradationReason = '';
}
return {
status: this.state,
reason: this.degradationReason,
metrics: {
heapUsagePercent,
uptime,
averageLatency: avgLatency,
},
recommendation:
this.state === 'FAILING'
? 'Schedule restart immediately'
: this.state === 'DEGRADED'
? 'Monitor and prepare for restart'
: 'All systems normal',
};
}
private async getAverageLatency(): Promise<number> {
// Implementation: return average request latency
return 0;
}
}
interface HealthReport {
status: 'HEALTHY' | 'DEGRADED' | 'FAILING';
reason: string;
metrics: {
heapUsagePercent: number;
uptime: number;
averageLatency: number;
};
recommendation: string;
}
Health Check Caching to Avoid Thundering Herd
Many instances checking dependencies simultaneously can overwhelm them.
class CachedHealthChecker {
private cache = new Map<string, { result: boolean; timestamp: number }>();
private cacheInterval = 5000; // 5 seconds
private checking = new Map<string, Promise<boolean>>();
async checkHealth(key: string, checker: () => Promise<boolean>): Promise<boolean> {
const cached = this.cache.get(key);
const now = Date.now();
// Return cached if fresh
if (cached && now - cached.timestamp < this.cacheInterval) {
return cached.result;
}
// If already checking, return that promise
if (this.checking.has(key)) {
return this.checking.get(key)!;
}
// Perform check
const promise = checker()
.then(result => {
this.cache.set(key, { result, timestamp: now });
this.checking.delete(key);
return result;
})
.catch(error => {
// Keep last known state on error
const lastKnown = cached?.result ?? true;
this.checking.delete(key);
return lastKnown;
});
this.checking.set(key, promise);
return promise;
}
}
// Usage with jitter to prevent thundering herd
class HealthCheckWithJitter {
private baseInterval = 10000; // 10 seconds
private jitterRange = 2000; // +/- 1 second
startHealthChecks(checker: () => Promise<void>): void {
// Randomize start time
const startDelay = Math.random() * this.jitterRange;
setTimeout(() => {
checker();
// Then check periodically with jitter
setInterval(() => {
const jitter = (Math.random() - 0.5) * this.jitterRange;
const interval = this.baseInterval + jitter;
setTimeout(checker, interval);
}, this.baseInterval);
}, startDelay);
}
}
Graceful Degradation
Don't fail completely when a dependency is slow; degrade gracefully.
class GracefulDegradation {
async handleWithFallback<T>(
primary: () => Promise<T>,
fallback: () => Promise<T>,
timeout = 2000
): Promise<T> {
try {
return await Promise.race([
primary(),
this.createTimeout(timeout),
]);
} catch (error) {
console.warn('Primary failed, using fallback:', error);
return fallback();
}
}
async handleWithDegradation<T>(
operation: () => Promise<T>,
degradedVersion: () => Promise<Partial<T>>,
timeout = 2000
): Promise<T | Partial<T>> {
try {
return await Promise.race([
operation(),
this.createTimeout(timeout),
]);
} catch (error) {
console.warn('Operation timed out, returning degraded response');
return degradedVersion();
}
}
private createTimeout(ms: number): Promise<never> {
return new Promise((_, reject) => {
setTimeout(() => reject(new Error('Timeout')), ms);
});
}
}
// Example: Product catalog with graceful degradation
class ProductCatalog {
async getProduct(productId: string): Promise<Product> {
return this.degradation.handleWithDegradation(
() => this.getFullProduct(productId), // Full product with recommendations
() => this.getBasicProduct(productId) // Just ID, name, price
);
}
private async getFullProduct(productId: string): Promise<Product> {
const [product, reviews, recommendations] = await Promise.all([
this.db.query('SELECT * FROM products WHERE id = $1', [productId]),
this.reviewService.getReviews(productId),
this.recommendationEngine.getRelated(productId),
]);
return { ...product, reviews, recommendations };
}
private async getBasicProduct(productId: string): Promise<Partial<Product>> {
const product = await this.db.query('SELECT id, name, price FROM products WHERE id = $1', [
productId,
]);
return product;
}
private degradation = new GracefulDegradation();
private db: any;
private reviewService: any;
private recommendationEngine: any;
}
interface Product {
id: string;
name: string;
price: number;
reviews?: any[];
recommendations?: any[];
}
Checklist
- Implement liveness, readiness, and startup probes for Kubernetes
- Check only critical dependencies in readiness probes
- Cache health check results to prevent thundering herd
- Set timeouts on dependency checks (2-3 seconds max)
- Aggregate health across dependencies intelligently
- Use graceful degradation for non-critical features
- Signal operational issues (memory, CPU) in health responses
- Monitor health check response times
- Test health checks under load
- Document what each health check actually validates
Conclusion
Health checks are your system's heartbeat. They drive orchestration decisions, alerting, and failover. Keep them simple and fast—test only what's necessary to know you can serve traffic. Cache results to avoid overwhelming dependencies. When a dependency is slow, degrade gracefully rather than failing completely. And always remember: a health check is an operational signal, not just a yes/no answer.