Auto-Scaling Gone Wrong — When Your Scaler Makes Things Worse

Introduction

Auto-scaling promises to handle any traffic spike automatically. But a misconfigured scaler can make incidents worse: it thrashes — scaling up, then immediately down, then up again every few minutes. Or it scales too slowly, spinning up instances 5 minutes after the spike. Or it scales to 200 instances, exhausting your database connection pool (20 connections × 200 instances = 4,000 connections > PostgreSQL's max of 100).

Auto-scaling gone wrong is often more dangerous than no auto-scaling at all.

Problem 1: Scale Thrashing
Problem 2: Scaling Too Slowly
Problem 3: Connection Pool Exhaustion at Scale
Problem 4: Scaling Metrics That Lag Reality
Problem 5: Slow Node Provisioning
Problem 6: Scale-Up But Not Scale-Down
Auto-Scaling Tuning Checklist
Conclusion

Problem 1: Scale Thrashing

The most common problem — scaler oscillates up and down continuously:

09:00 — CPU 85% → scale up to 10 pods
09:03 — CPU 30% → scale down to 3 pods
09:05 — CPU 90% → scale up to 12 pods
09:08 — CPU 25% → scale down to 3 pods
...repeats every 3 minutes

During each scale-down, cold starts hit your few remaining pods. Users feel the impact. Scale-up triggers again. Loop.

Fix: Stabilization Windows

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    name: api-deployment
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up again
      policies:
        - type: Pods
          value: 4                      # Add max 4 pods at a time
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 MINUTES before scaling down
      policies:
        - type: Percent
          value: 10                     # Remove max 10% of pods at a time
          periodSeconds: 60

Problem 2: Scaling Too Slowly

Default CPU-based HPA reacts slowly because CPU averages lag behind actual load:

Traffic spike at 09:00
HPA evaluates every 15s → sees CPU spike at 09:00:15
Scales up decision at 09:00:15
New pods start at 09:03 (3 min startup + health check time)
Spike was over by 09:02

Result: scaling happened AFTER the spike. Useless.

Fix: Scale on Request Latency or Queue Depth (not just CPU)

# Scale on request rate or latency using KEDA (Kubernetes Event-Driven Autoscaling)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaler
spec:
  scaleTargetRef:
    name: api-deployment
  minReplicaCount: 3
  maxReplicaCount: 50
  triggers:
    # Scale on queue depth (react instantly when work piles up)
    - type: redis
      metadata:
        address: redis:6379
        listName: request-queue
        listLength: "100"     # Scale up when queue > 100

    # Scale on HTTP request rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_per_second
        threshold: "500"      # Scale when RPS > 500
        query: rate(http_requests_total[1m])

Problem 3: Connection Pool Exhaustion at Scale

Your app: max DB connections = 20 per instance
Scaled to: 200 instances (auto-scaled during traffic spike)
Total connections: 200 × 20 = 4,000 connections

PostgreSQL max_connections: 100

Result: 3,900 connection attempts fail → app crashes
Ironic: auto-scaling caused the crash it was supposed to prevent

Fix: Scale-Aware Connection Pooling

// Dynamically adjust pool size based on number of replicas
const TOTAL_DB_CONNECTIONS = 80  // Leave some for admin tools
const REPLICA_COUNT = parseInt(process.env.REPLICA_COUNT || '1')
const poolSize = Math.max(2, Math.floor(TOTAL_DB_CONNECTIONS / REPLICA_COUNT))

const pool = new Pool({
  max: poolSize,
  min: 1,
})

console.log(`DB pool size: ${poolSize} (${REPLICA_COUNT} replicas)`)

# Set replica count as environment variable
env:
  - name: REPLICA_COUNT
    valueFrom:
      fieldRef:
        fieldPath: metadata.annotations['autoscaler.alpha.kubernetes.io/current-replicas']

Better fix: PgBouncer proxy

All instances → PgBouncer (manages real DB connections)
PgBouncer → PostgreSQL (fixed 20 real connections)

200 app instances each "claim" connections from PgBouncer
PgBouncer multiplexes to only 20 real DB connections
DB connection count stays constant regardless of scale

Problem 4: Scaling Metrics That Lag Reality

CPU is a lagging indicator — it rises after your app is already overwhelmed:

// ✅ Expose custom metrics that reflect load BEFORE CPU spikes
import promClient from 'prom-client'

const activeRequests = new promClient.Gauge({
  name: 'http_active_requests',
  help: 'Number of requests currently being processed',
})

const requestQueueDepth = new promClient.Gauge({
  name: 'request_queue_depth',
  help: 'Requests waiting to be processed',
})

app.use((req, res, next) => {
  activeRequests.inc()
  res.on('finish', () => activeRequests.dec())
  next()
})

// Expose for Prometheus scraping
app.get('/metrics', (req, res) => {
  res.set('Content-Type', promClient.register.contentType)
  promClient.register.metrics().then(m => res.end(m))
})

# HPA on custom metric: active requests per pod
metrics:
  - type: Pods
    pods:
      metric:
        name: http_active_requests
      target:
        type: AverageValue
        averageValue: "50"  # Scale when average > 50 active requests per pod

Problem 5: Slow Node Provisioning

HPA adds pods, but there's no node to place them on. Cluster autoscaler takes 2-5 minutes:

# AWS EKS: Use Karpenter for fast node provisioning
# Karpenter can provision nodes in &lt;30 seconds

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["on-demand"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3.large", "t3.xlarge"]
  limits:
    resources:
      cpu: "1000"
  ttlSecondsAfterEmpty: 30   # Remove empty nodes quickly

# Pre-provision "buffer" nodes
# Keep 2 spare nodes always ready
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
  # Always maintain enough capacity for at least 10 extra pods
  consolidationPolicy: WhenUnderutilized
  ttlSecondsAfterEmpty: 300  # Don't remove buffer nodes too quickly

Problem 6: Scale-Up But Not Scale-Down

Opposite problem — pods scale up but never scale down. You pay for 50 pods at 2 AM:

# Ensure scale-down is enabled and configured
behavior:
  scaleDown:
    stabilizationWindowSeconds: 300   # 5 min cooldown
    policies:
      - type: Percent
        value: 25       # Remove max 25% at a time
        periodSeconds: 120
      - type: Pods
        value: 2        # Or max 2 pods at a time
        periodSeconds: 120
    selectPolicy: Min   # Use the more conservative policy

Auto-Scaling Tuning Checklist

Parameter	Default	Recommendation
Scale-up stabilization	0s	30-60s
Scale-down stabilization	300s	300-600s
Max scale-up rate	100%	50-100% per minute
Max scale-down rate	100%	10-25% per minute
Metric: CPU target	80%	50-60%
Metric type	CPU	CPU + request rate + queue
Min replicas	1	3 (for HA)

Conclusion

Auto-scaling is not plug-and-play. It requires careful tuning of stabilization windows (to prevent thrashing), scale-up rate limits, the right metrics (request rate is better than CPU for APIs), and awareness of downstream resource limits like database connections. Get it right and auto-scaling is genuinely magical. Get it wrong and it creates incidents on its own. Tune these parameters based on your application's actual behavior patterns.