- Published on
Auto-Scaling Gone Wrong — When Your Scaler Makes Things Worse
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Auto-scaling promises to handle any traffic spike automatically. But a misconfigured scaler can make incidents worse: it thrashes — scaling up, then immediately down, then up again every few minutes. Or it scales too slowly, spinning up instances 5 minutes after the spike. Or it scales to 200 instances, exhausting your database connection pool (20 connections × 200 instances = 4,000 connections > PostgreSQL's max of 100).
Auto-scaling gone wrong is often more dangerous than no auto-scaling at all.
- Problem 1: Scale Thrashing
- Problem 2: Scaling Too Slowly
- Problem 3: Connection Pool Exhaustion at Scale
- Problem 4: Scaling Metrics That Lag Reality
- Problem 5: Slow Node Provisioning
- Problem 6: Scale-Up But Not Scale-Down
- Auto-Scaling Tuning Checklist
- Conclusion
Problem 1: Scale Thrashing
The most common problem — scaler oscillates up and down continuously:
09:00 — CPU 85% → scale up to 10 pods
09:03 — CPU 30% → scale down to 3 pods
09:05 — CPU 90% → scale up to 12 pods
09:08 — CPU 25% → scale down to 3 pods
...repeats every 3 minutes
During each scale-down, cold starts hit your few remaining pods. Users feel the impact. Scale-up triggers again. Loop.
Fix: Stabilization Windows
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
name: api-deployment
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up again
policies:
- type: Pods
value: 4 # Add max 4 pods at a time
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 MINUTES before scaling down
policies:
- type: Percent
value: 10 # Remove max 10% of pods at a time
periodSeconds: 60
Problem 2: Scaling Too Slowly
Default CPU-based HPA reacts slowly because CPU averages lag behind actual load:
Traffic spike at 09:00
HPA evaluates every 15s → sees CPU spike at 09:00:15
Scales up decision at 09:00:15
New pods start at 09:03 (3 min startup + health check time)
Spike was over by 09:02
Result: scaling happened AFTER the spike. Useless.
Fix: Scale on Request Latency or Queue Depth (not just CPU)
# Scale on request rate or latency using KEDA (Kubernetes Event-Driven Autoscaling)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-scaler
spec:
scaleTargetRef:
name: api-deployment
minReplicaCount: 3
maxReplicaCount: 50
triggers:
# Scale on queue depth (react instantly when work piles up)
- type: redis
metadata:
address: redis:6379
listName: request-queue
listLength: "100" # Scale up when queue > 100
# Scale on HTTP request rate
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_per_second
threshold: "500" # Scale when RPS > 500
query: rate(http_requests_total[1m])
Problem 3: Connection Pool Exhaustion at Scale
Your app: max DB connections = 20 per instance
Scaled to: 200 instances (auto-scaled during traffic spike)
Total connections: 200 × 20 = 4,000 connections
PostgreSQL max_connections: 100
Result: 3,900 connection attempts fail → app crashes
Ironic: auto-scaling caused the crash it was supposed to prevent
Fix: Scale-Aware Connection Pooling
// Dynamically adjust pool size based on number of replicas
const TOTAL_DB_CONNECTIONS = 80 // Leave some for admin tools
const REPLICA_COUNT = parseInt(process.env.REPLICA_COUNT || '1')
const poolSize = Math.max(2, Math.floor(TOTAL_DB_CONNECTIONS / REPLICA_COUNT))
const pool = new Pool({
max: poolSize,
min: 1,
})
console.log(`DB pool size: ${poolSize} (${REPLICA_COUNT} replicas)`)
# Set replica count as environment variable
env:
- name: REPLICA_COUNT
valueFrom:
fieldRef:
fieldPath: metadata.annotations['autoscaler.alpha.kubernetes.io/current-replicas']
Better fix: PgBouncer proxy
All instances → PgBouncer (manages real DB connections)
PgBouncer → PostgreSQL (fixed 20 real connections)
200 app instances each "claim" connections from PgBouncer
PgBouncer multiplexes to only 20 real DB connections
DB connection count stays constant regardless of scale
Problem 4: Scaling Metrics That Lag Reality
CPU is a lagging indicator — it rises after your app is already overwhelmed:
// ✅ Expose custom metrics that reflect load BEFORE CPU spikes
import promClient from 'prom-client'
const activeRequests = new promClient.Gauge({
name: 'http_active_requests',
help: 'Number of requests currently being processed',
})
const requestQueueDepth = new promClient.Gauge({
name: 'request_queue_depth',
help: 'Requests waiting to be processed',
})
app.use((req, res, next) => {
activeRequests.inc()
res.on('finish', () => activeRequests.dec())
next()
})
// Expose for Prometheus scraping
app.get('/metrics', (req, res) => {
res.set('Content-Type', promClient.register.contentType)
promClient.register.metrics().then(m => res.end(m))
})
# HPA on custom metric: active requests per pod
metrics:
- type: Pods
pods:
metric:
name: http_active_requests
target:
type: AverageValue
averageValue: "50" # Scale when average > 50 active requests per pod
Problem 5: Slow Node Provisioning
HPA adds pods, but there's no node to place them on. Cluster autoscaler takes 2-5 minutes:
# AWS EKS: Use Karpenter for fast node provisioning
# Karpenter can provision nodes in <30 seconds
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t3.large", "t3.xlarge"]
limits:
resources:
cpu: "1000"
ttlSecondsAfterEmpty: 30 # Remove empty nodes quickly
# Pre-provision "buffer" nodes
# Keep 2 spare nodes always ready
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
# Always maintain enough capacity for at least 10 extra pods
consolidationPolicy: WhenUnderutilized
ttlSecondsAfterEmpty: 300 # Don't remove buffer nodes too quickly
Problem 6: Scale-Up But Not Scale-Down
Opposite problem — pods scale up but never scale down. You pay for 50 pods at 2 AM:
# Ensure scale-down is enabled and configured
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 min cooldown
policies:
- type: Percent
value: 25 # Remove max 25% at a time
periodSeconds: 120
- type: Pods
value: 2 # Or max 2 pods at a time
periodSeconds: 120
selectPolicy: Min # Use the more conservative policy
Auto-Scaling Tuning Checklist
| Parameter | Default | Recommendation |
|---|---|---|
| Scale-up stabilization | 0s | 30-60s |
| Scale-down stabilization | 300s | 300-600s |
| Max scale-up rate | 100% | 50-100% per minute |
| Max scale-down rate | 100% | 10-25% per minute |
| Metric: CPU target | 80% | 50-60% |
| Metric type | CPU | CPU + request rate + queue |
| Min replicas | 1 | 3 (for HA) |
Conclusion
Auto-scaling is not plug-and-play. It requires careful tuning of stabilization windows (to prevent thrashing), scale-up rate limits, the right metrics (request rate is better than CPU for APIs), and awareness of downstream resource limits like database connections. Get it right and auto-scaling is genuinely magical. Get it wrong and it creates incidents on its own. Tune these parameters based on your application's actual behavior patterns.