- Published on
Load Balancer Misconfiguration — The Hidden Single Point of Failure
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Your load balancer is supposed to distribute traffic evenly across 5 servers. Instead, server #1 is at 90% CPU while the others sit at 5%. Or worse, the LB is routing to a crashed server, causing 30% of requests to fail. Your metrics say everything is fine.
Load balancer misconfigurations are invisible until they cause major incidents.
- Misconfiguration 1: Session Stickiness Without Awareness
- Misconfiguration 2: Wrong Health Check
- Misconfiguration 3: No Connection Draining on Deploy
- Misconfiguration 4: Timeout Mismatch
- Misconfiguration 5: Round Robin on Unequal Servers
- Misconfiguration 6: Forgetting Slow Start After Deploy
- Monitoring Load Balancer Health
- AWS ALB Common Misconfigurations
- Load Balancer Configuration Checklist
- Conclusion
Misconfiguration 1: Session Stickiness Without Awareness
IP-based stickiness sends all users from the same IP to the same server:
# ❌ IP hash — all users behind a corporate NAT hit server #1
upstream backend {
ip_hash;
server app1:3000;
server app2:3000;
server app3:3000;
}
# Corporate office with 10,000 employees → same IP → server #1 is HOT
# ✅ FIX: Use cookie-based stickiness (more even distribution)
upstream backend {
server app1:3000;
server app2:3000;
server app3:3000;
}
# Sticky session via cookie
upstream backend {
server app1:3000;
server app2:3000;
sticky cookie srv_id expires=1h path=/;
}
# ✅ BETTER FIX: Make your app stateless — eliminate stickiness need
# Store sessions in Redis, not in-process
# Then any server can handle any request
upstream backend {
least_conn; # Always route to server with fewest active connections
server app1:3000;
server app2:3000;
server app3:3000;
}
Misconfiguration 2: Wrong Health Check
The LB's health check doesn't match actual application health:
# ❌ Health check hits wrong endpoint or wrong port
upstream backend {
server app1:3000;
check interval=3000 rise=2 fall=3 timeout=1000 type=tcp;
# TCP check: just tests if port is open
# But your app might be up but broken internally!
}
# ✅ Check actual application health
upstream backend {
server app1:3000;
check interval=3000 rise=2 fall=3 timeout=2000 type=http;
check_http_send "GET /health HTTP/1.0\r\nHost: localhost\r\n\r\n";
check_http_expect_alive http_2xx;
}
// Make your health check actually verify what matters
app.get('/health', async (req, res) => {
try {
// Check DB connectivity
await db.query('SELECT 1')
// Check Redis connectivity
await redis.ping()
// Check memory is OK
const heapMB = process.memoryUsage().heapUsed / 1024 / 1024
if (heapMB > 800) {
return res.status(503).json({ status: 'unhealthy', reason: 'high memory' })
}
res.status(200).json({ status: 'healthy', heapMB: heapMB.toFixed(0) })
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: err.message })
}
})
Misconfiguration 3: No Connection Draining on Deploy
# ❌ Nginx just stops sending to server immediately during deploy
# In-flight requests to that server are dropped!
# ✅ Kubernetes: configure connection draining
spec:
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
terminationGracePeriodSeconds: 60 # Give 60s to finish requests
containers:
- lifecycle:
preStop:
exec:
command: ["sleep", "15"] # Wait 15s before SIGTERM
# This drains load balancer before shutdown starts
# Nginx: remove server from pool before stopping
# In nginx upstream, when a server is removed, in-flight requests finish
upstream backend {
server app1:3000;
server app2:3000 weight=1; # Reduce weight to 0 before removing
}
Misconfiguration 4: Timeout Mismatch
# ❌ LB timeout shorter than app timeout
# LB gives up at 30s, but app was going to respond in 35s
# LB retries → double execution risk
upstream backend {
server app1:3000;
keepalive 32;
}
server {
proxy_connect_timeout 5s;
proxy_read_timeout 30s; # LB timeout
proxy_send_timeout 30s;
}
// ❌ App sets timeout to 60s — longer than LB's 30s
app.use((req, res, next) => {
req.setTimeout(60_000) // Never actually reached — LB cuts at 30s
next()
})
# ✅ App timeout SHORTER than LB timeout
server {
proxy_read_timeout 35s; # LB: 35s
}
// App: 30s — gives LB time to get the error response
app.use((req, res, next) => {
req.setTimeout(30_000, () => {
res.status(504).json({ error: 'Request timeout' })
})
next()
})
Misconfiguration 5: Round Robin on Unequal Servers
# ❌ Round robin treats all servers equally
# But server #3 has half the RAM and is always overwhelmed
upstream backend {
server app1:3000; # 16GB RAM, 8 cores
server app2:3000; # 16GB RAM, 8 cores
server app3:3000; # 4GB RAM, 2 cores — gets same load!
}
# ✅ Weight based on capacity
upstream backend {
server app1:3000 weight=4; # Gets 4x the traffic
server app2:3000 weight=4;
server app3:3000 weight=1; # Gets 1/9 of traffic
}
# ✅ Or least_conn — automatically sends to least loaded server
upstream backend {
least_conn;
server app1:3000;
server app2:3000;
server app3:3000;
}
Misconfiguration 6: Forgetting Slow Start After Deploy
# ✅ Slow start — gradually ramp new server instead of full traffic
upstream backend {
server app1:3000 slow_start=30s; # Ramp to full weight over 30s
server app2:3000 slow_start=30s;
server app3:3000; # Existing server — no slow start
}
Monitoring Load Balancer Health
# Check nginx upstream status
curl http://localhost/nginx_status
# Check which servers are up/down
curl http://localhost/upstream_status
# Per-server request distribution (should be ~equal)
# If one server is getting 80% → check for stickiness or weight issues
// Log which server handled each request (for debugging)
app.use((req, res, next) => {
res.setHeader('X-Served-By', os.hostname())
next()
})
// Check distribution from logs
// Should see roughly equal distribution across hostnames
AWS ALB Common Misconfigurations
// ALB health check path must return 200
// Common mistake: returning 301/302 redirect
app.get('/health', (req, res) => {
// ❌ Never redirect health checks
// res.redirect('/api/health')
// ✅ Always return 200 directly
res.status(200).json({ status: 'ok' })
})
// ALB idle timeout vs app keepalive
// ALB default: 60s idle timeout
// Node.js default: 5s keepalive
// Fix: set Node.js keepalive > ALB idle timeout
const server = app.listen(3000)
server.keepAliveTimeout = 65_000 // Slightly more than ALB's 60s
server.headersTimeout = 66_000 // Must be > keepAliveTimeout
Load Balancer Configuration Checklist
- ✅ Health checks verify actual app functionality (DB + Redis)
- ✅ Session stickiness only if stateful (prefer stateless + Redis)
- ✅
least_connalgorithm for APIs with variable response times - ✅ Weights configured for heterogeneous servers
- ✅ App timeout < LB timeout (no duplicate requests)
- ✅ Connection draining on deploy (terminationGracePeriodSeconds)
- ✅ Slow start for new servers after deployment
- ✅ Node.js keepAliveTimeout > ALB/ELB idle timeout
- ✅ Monitor per-server request distribution
Conclusion
Load balancer misconfigurations are uniquely dangerous because they're invisible in small-scale testing and only show up under production conditions. A misconfigured health check routes traffic to dead servers. Wrong stickiness creates hot spots. Mismatched timeouts cause duplicate request processing. Audit your LB config before every major traffic event — it's one of the highest-leverage reliability investments you can make.