Published on

Load Balancer Misconfiguration — The Hidden Single Point of Failure

Authors

Introduction

Your load balancer is supposed to distribute traffic evenly across 5 servers. Instead, server #1 is at 90% CPU while the others sit at 5%. Or worse, the LB is routing to a crashed server, causing 30% of requests to fail. Your metrics say everything is fine.

Load balancer misconfigurations are invisible until they cause major incidents.

Misconfiguration 1: Session Stickiness Without Awareness

IP-based stickiness sends all users from the same IP to the same server:

# ❌ IP hash — all users behind a corporate NAT hit server #1
upstream backend {
  ip_hash;
  server app1:3000;
  server app2:3000;
  server app3:3000;
}
# Corporate office with 10,000 employees → same IP → server #1 is HOT
# ✅ FIX: Use cookie-based stickiness (more even distribution)
upstream backend {
  server app1:3000;
  server app2:3000;
  server app3:3000;
}

# Sticky session via cookie
upstream backend {
  server app1:3000;
  server app2:3000;
  sticky cookie srv_id expires=1h path=/;
}
# ✅ BETTER FIX: Make your app stateless — eliminate stickiness need
# Store sessions in Redis, not in-process
# Then any server can handle any request
upstream backend {
  least_conn;  # Always route to server with fewest active connections
  server app1:3000;
  server app2:3000;
  server app3:3000;
}

Misconfiguration 2: Wrong Health Check

The LB's health check doesn't match actual application health:

# ❌ Health check hits wrong endpoint or wrong port
upstream backend {
  server app1:3000;
  check interval=3000 rise=2 fall=3 timeout=1000 type=tcp;
  # TCP check: just tests if port is open
  # But your app might be up but broken internally!
}
# ✅ Check actual application health
upstream backend {
  server app1:3000;
  check interval=3000 rise=2 fall=3 timeout=2000 type=http;
  check_http_send "GET /health HTTP/1.0\r\nHost: localhost\r\n\r\n";
  check_http_expect_alive http_2xx;
}
// Make your health check actually verify what matters
app.get('/health', async (req, res) => {
  try {
    // Check DB connectivity
    await db.query('SELECT 1')

    // Check Redis connectivity
    await redis.ping()

    // Check memory is OK
    const heapMB = process.memoryUsage().heapUsed / 1024 / 1024
    if (heapMB > 800) {
      return res.status(503).json({ status: 'unhealthy', reason: 'high memory' })
    }

    res.status(200).json({ status: 'healthy', heapMB: heapMB.toFixed(0) })
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message })
  }
})

Misconfiguration 3: No Connection Draining on Deploy

# ❌ Nginx just stops sending to server immediately during deploy
# In-flight requests to that server are dropped!

# ✅ Kubernetes: configure connection draining
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      terminationGracePeriodSeconds: 60  # Give 60s to finish requests
      containers:
        - lifecycle:
            preStop:
              exec:
                command: ["sleep", "15"]  # Wait 15s before SIGTERM
                # This drains load balancer before shutdown starts
# Nginx: remove server from pool before stopping
# In nginx upstream, when a server is removed, in-flight requests finish
upstream backend {
  server app1:3000;
  server app2:3000 weight=1;  # Reduce weight to 0 before removing
}

Misconfiguration 4: Timeout Mismatch

# ❌ LB timeout shorter than app timeout
# LB gives up at 30s, but app was going to respond in 35s
# LB retries → double execution risk

upstream backend {
  server app1:3000;
  keepalive 32;
}

server {
  proxy_connect_timeout 5s;
  proxy_read_timeout 30s;    # LB timeout
  proxy_send_timeout 30s;
}
// ❌ App sets timeout to 60s — longer than LB's 30s
app.use((req, res, next) => {
  req.setTimeout(60_000)  // Never actually reached — LB cuts at 30s
  next()
})
# ✅ App timeout SHORTER than LB timeout
server {
  proxy_read_timeout 35s;  # LB: 35s
}
// App: 30s — gives LB time to get the error response
app.use((req, res, next) => {
  req.setTimeout(30_000, () => {
    res.status(504).json({ error: 'Request timeout' })
  })
  next()
})

Misconfiguration 5: Round Robin on Unequal Servers

# ❌ Round robin treats all servers equally
# But server #3 has half the RAM and is always overwhelmed
upstream backend {
  server app1:3000;   # 16GB RAM, 8 cores
  server app2:3000;   # 16GB RAM, 8 cores
  server app3:3000;   # 4GB RAM, 2 cores — gets same load!
}
# ✅ Weight based on capacity
upstream backend {
  server app1:3000 weight=4;  # Gets 4x the traffic
  server app2:3000 weight=4;
  server app3:3000 weight=1;  # Gets 1/9 of traffic
}

# ✅ Or least_conn — automatically sends to least loaded server
upstream backend {
  least_conn;
  server app1:3000;
  server app2:3000;
  server app3:3000;
}

Misconfiguration 6: Forgetting Slow Start After Deploy

# ✅ Slow start — gradually ramp new server instead of full traffic
upstream backend {
  server app1:3000 slow_start=30s;  # Ramp to full weight over 30s
  server app2:3000 slow_start=30s;
  server app3:3000;  # Existing server — no slow start
}

Monitoring Load Balancer Health

# Check nginx upstream status
curl http://localhost/nginx_status

# Check which servers are up/down
curl http://localhost/upstream_status

# Per-server request distribution (should be ~equal)
# If one server is getting 80% → check for stickiness or weight issues
// Log which server handled each request (for debugging)
app.use((req, res, next) => {
  res.setHeader('X-Served-By', os.hostname())
  next()
})

// Check distribution from logs
// Should see roughly equal distribution across hostnames

AWS ALB Common Misconfigurations

// ALB health check path must return 200
// Common mistake: returning 301/302 redirect
app.get('/health', (req, res) => {
  // ❌ Never redirect health checks
  // res.redirect('/api/health')

  // ✅ Always return 200 directly
  res.status(200).json({ status: 'ok' })
})

// ALB idle timeout vs app keepalive
// ALB default: 60s idle timeout
// Node.js default: 5s keepalive
// Fix: set Node.js keepalive > ALB idle timeout
const server = app.listen(3000)
server.keepAliveTimeout = 65_000    // Slightly more than ALB's 60s
server.headersTimeout = 66_000      // Must be > keepAliveTimeout

Load Balancer Configuration Checklist

  • ✅ Health checks verify actual app functionality (DB + Redis)
  • ✅ Session stickiness only if stateful (prefer stateless + Redis)
  • least_conn algorithm for APIs with variable response times
  • ✅ Weights configured for heterogeneous servers
  • ✅ App timeout < LB timeout (no duplicate requests)
  • ✅ Connection draining on deploy (terminationGracePeriodSeconds)
  • ✅ Slow start for new servers after deployment
  • ✅ Node.js keepAliveTimeout > ALB/ELB idle timeout
  • ✅ Monitor per-server request distribution

Conclusion

Load balancer misconfigurations are uniquely dangerous because they're invisible in small-scale testing and only show up under production conditions. A misconfigured health check routes traffic to dead servers. Wrong stickiness creates hot spots. Mismatched timeouts cause duplicate request processing. Audit your LB config before every major traffic event — it's one of the highest-leverage reliability investments you can make.