Published on

Auto-Scaling Gone Wrong — When Your Scaler Makes Things Worse

Authors

Introduction

Auto-scaling promises to handle any traffic spike automatically. But a misconfigured scaler can make incidents worse: it thrashes — scaling up, then immediately down, then up again every few minutes. Or it scales too slowly, spinning up instances 5 minutes after the spike. Or it scales to 200 instances, exhausting your database connection pool (20 connections × 200 instances = 4,000 connections > PostgreSQL's max of 100).

Auto-scaling gone wrong is often more dangerous than no auto-scaling at all.

Problem 1: Scale Thrashing

The most common problem — scaler oscillates up and down continuously:

09:00CPU 85% → scale up to 10 pods
09:03CPU 30% → scale down to 3 pods
09:05CPU 90% → scale up to 12 pods
09:08CPU 25% → scale down to 3 pods
...repeats every 3 minutes

During each scale-down, cold starts hit your few remaining pods. Users feel the impact. Scale-up triggers again. Loop.

Fix: Stabilization Windows

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    name: api-deployment
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up again
      policies:
        - type: Pods
          value: 4                      # Add max 4 pods at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 MINUTES before scaling down
      policies:
        - type: Percent
          value: 10                     # Remove max 10% of pods at a time
          periodSeconds: 60

Problem 2: Scaling Too Slowly

Default CPU-based HPA reacts slowly because CPU averages lag behind actual load:

Traffic spike at 09:00
HPA evaluates every 15s → sees CPU spike at 09:00:15
Scales up decision at 09:00:15
New pods start at 09:03 (3 min startup + health check time)
Spike was over by 09:02

Result: scaling happened AFTER the spike. Useless.

Fix: Scale on Request Latency or Queue Depth (not just CPU)

# Scale on request rate or latency using KEDA (Kubernetes Event-Driven Autoscaling)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaler
spec:
  scaleTargetRef:
    name: api-deployment
  minReplicaCount: 3
  maxReplicaCount: 50
  triggers:
    # Scale on queue depth (react instantly when work piles up)
    - type: redis
      metadata:
        address: redis:6379
        listName: request-queue
        listLength: "100"     # Scale up when queue > 100

    # Scale on HTTP request rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_per_second
        threshold: "500"      # Scale when RPS > 500
        query: rate(http_requests_total[1m])

Problem 3: Connection Pool Exhaustion at Scale

Your app: max DB connections = 20 per instance
Scaled to: 200 instances (auto-scaled during traffic spike)
Total connections: 200 × 20 = 4,000 connections

PostgreSQL max_connections: 100

Result: 3,900 connection attempts fail → app crashes
Ironic: auto-scaling caused the crash it was supposed to prevent

Fix: Scale-Aware Connection Pooling

// Dynamically adjust pool size based on number of replicas
const TOTAL_DB_CONNECTIONS = 80  // Leave some for admin tools
const REPLICA_COUNT = parseInt(process.env.REPLICA_COUNT || '1')
const poolSize = Math.max(2, Math.floor(TOTAL_DB_CONNECTIONS / REPLICA_COUNT))

const pool = new Pool({
  max: poolSize,
  min: 1,
})

console.log(`DB pool size: ${poolSize} (${REPLICA_COUNT} replicas)`)
# Set replica count as environment variable
env:
  - name: REPLICA_COUNT
    valueFrom:
      fieldRef:
        fieldPath: metadata.annotations['autoscaler.alpha.kubernetes.io/current-replicas']

Better fix: PgBouncer proxy

All instances → PgBouncer (manages real DB connections)
PgBouncerPostgreSQL (fixed 20 real connections)

200 app instances each "claim" connections from PgBouncer
PgBouncer multiplexes to only 20 real DB connections
DB connection count stays constant regardless of scale

Problem 4: Scaling Metrics That Lag Reality

CPU is a lagging indicator — it rises after your app is already overwhelmed:

// ✅ Expose custom metrics that reflect load BEFORE CPU spikes
import promClient from 'prom-client'

const activeRequests = new promClient.Gauge({
  name: 'http_active_requests',
  help: 'Number of requests currently being processed',
})

const requestQueueDepth = new promClient.Gauge({
  name: 'request_queue_depth',
  help: 'Requests waiting to be processed',
})

app.use((req, res, next) => {
  activeRequests.inc()
  res.on('finish', () => activeRequests.dec())
  next()
})

// Expose for Prometheus scraping
app.get('/metrics', (req, res) => {
  res.set('Content-Type', promClient.register.contentType)
  promClient.register.metrics().then(m => res.end(m))
})
# HPA on custom metric: active requests per pod
metrics:
  - type: Pods
    pods:
      metric:
        name: http_active_requests
      target:
        type: AverageValue
        averageValue: "50"  # Scale when average > 50 active requests per pod

Problem 5: Slow Node Provisioning

HPA adds pods, but there's no node to place them on. Cluster autoscaler takes 2-5 minutes:

# AWS EKS: Use Karpenter for fast node provisioning
# Karpenter can provision nodes in <30 seconds

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["on-demand"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3.large", "t3.xlarge"]
  limits:
    resources:
      cpu: "1000"
  ttlSecondsAfterEmpty: 30   # Remove empty nodes quickly
# Pre-provision "buffer" nodes
# Keep 2 spare nodes always ready
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
  # Always maintain enough capacity for at least 10 extra pods
  consolidationPolicy: WhenUnderutilized
  ttlSecondsAfterEmpty: 300  # Don't remove buffer nodes too quickly

Problem 6: Scale-Up But Not Scale-Down

Opposite problem — pods scale up but never scale down. You pay for 50 pods at 2 AM:

# Ensure scale-down is enabled and configured
behavior:
  scaleDown:
    stabilizationWindowSeconds: 300   # 5 min cooldown
    policies:
      - type: Percent
        value: 25       # Remove max 25% at a time
        periodSeconds: 120
      - type: Pods
        value: 2        # Or max 2 pods at a time
        periodSeconds: 120
    selectPolicy: Min   # Use the more conservative policy

Auto-Scaling Tuning Checklist

ParameterDefaultRecommendation
Scale-up stabilization0s30-60s
Scale-down stabilization300s300-600s
Max scale-up rate100%50-100% per minute
Max scale-down rate100%10-25% per minute
Metric: CPU target80%50-60%
Metric typeCPUCPU + request rate + queue
Min replicas13 (for HA)

Conclusion

Auto-scaling is not plug-and-play. It requires careful tuning of stabilization windows (to prevent thrashing), scale-up rate limits, the right metrics (request rate is better than CPU for APIs), and awareness of downstream resource limits like database connections. Get it right and auto-scaling is genuinely magical. Get it wrong and it creates incidents on its own. Tune these parameters based on your application's actual behavior patterns.