Published on

Underprovisioned Infrastructure Causing Downtime — When "Good Enough" Isn't

Authors

Introduction

Underprovisioning is the silent counterpart to overprovisioning. Overprovisioned infrastructure wastes money gradually; underprovisioned infrastructure fails suddenly. The t3.micro RDS that works perfectly in development will OOM under a production JOIN on a 10M-row table. The single-AZ instance that's "never had issues" in two years will have its first issue on the Friday of your product launch. The cost of underprovisioning isn't measured in dollars per month — it's measured in downtime, revenue loss, and customer trust.

How Underprovisioning Kills You

Underprovisioning failure patterns:

1. Memory exhaustion under real load
Dev: t3.micro with 1GB RAM, 100 rows/table
Prod: same instance, 10M rows, JOINs use 3GB RAM
OOM killer terminates PostgreSQL mid-query
Connection pool error → app returns 500s

2. CPU throttling at critical moments
   → t3/t4g instances use burstable CPU (CPU credits)
Credits depleted during traffic spike
CPU throttled to 5% baseline → requests time out

3. Single-AZ downtime
AZ failure: 2-4 hours of RDS unavailability
"It's never happened" is not an architecture

4. Disk I/O bottleneck
   → gp2 volume: 100 IOPS baseline (burst to 3000)
Burst credits run out → sustained 100 IOPS
Write-heavy workload: queries queue, latency spikes

5. Connection exhaustion
RDS db.t3.micro: max_connections = 85
10 app instances × 10 connections each = 100FATAL: too many connections

Fix 1: Load Test Before You Go Live

# Don't discover underprovisioning during a real incident
# Discover it in staging with production-scale data

# k6 load test to find resource limits
k6 run --vus 100 --duration 10m loadtest.js

# Watch these during the test:
# RDS CloudWatch:
# - CPUUtilization (should stay < 70%)
# - FreeableMemory (should stay > 20% of instance RAM)
# - DatabaseConnections (should stay under max_connections * 0.8)

# ECS CloudWatch:
# - CPUUtilization (should stay < 70% to leave room for spikes)
# - MemoryUtilization (should stay < 80%)
// Establish resource baselines BEFORE a traffic spike hits
async function captureResourceBaseline() {
  const metrics = await Promise.all([
    getMetric('AWS/RDS', 'CPUUtilization', 'myapp-prod'),
    getMetric('AWS/RDS', 'FreeableMemory', 'myapp-prod'),
    getMetric('AWS/RDS', 'DatabaseConnections', 'myapp-prod'),
    getMetric('AWS/ECS', 'CPUUtilization', 'production/myapp-api'),
    getMetric('AWS/ECS', 'MemoryUtilization', 'production/myapp-api'),
  ])

  const [rdscpu, rdsRam, rdsConns, ecsCpu, ecsMem] = metrics

  console.log('Current resource utilization:')
  console.log(`  RDS CPU:     ${rdscu}% (warn at 70%)`)
  console.log(`  RDS RAM:     ${((1 - rdsRam / INSTANCE_RAM) * 100).toFixed(0)}% used`)
  console.log(`  RDS Conns:   ${rdsConns} / ${MAX_CONNECTIONS}`)
  console.log(`  ECS CPU:     ${ecsCpu}%`)
  console.log(`  ECS Memory:  ${ecsMem}%`)

  // If any are above 60%, you have little headroom for traffic spikes
}

Fix 2: Never Use Burstable Instances for Production Databases

# ❌ Burstable instances (t3, t4g) for RDS:
# - CPU credit system: earns credits when idle, spends during load
# - When credits run out: throttled to 5-20% of base CPU
# - Happens at the worst time: sustained production load

# ✅ Use General Purpose (m) or Memory Optimized (r) for production:
# m6g.large:  2 vCPU, 8GB RAM   → $0.156/hr  (no burst limits)
# r6g.large:  2 vCPU, 16GB RAM  → $0.192/hr  (better for DB workloads)
# vs
# t4g.medium: 2 vCPU, 4GB RAM   → $0.068/hr  (looks cheaper, fails under load)

# The $0.124/hr difference ($89/month) is the cost of never having
# a CPU throttling incident on your primary database.

# Check if your instance is running out of CPU credits:
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUCreditBalance \
  --dimensions Name=DBInstanceIdentifier,Value=myapp-prod \
  --start-time "$(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 3600 \
  --statistics Minimum
# If minimum ever hits 0: you're getting throttled

Fix 3: Minimum Viable Redundancy for Every Stateful Resource

# Every production stateful component needs:
# 1. Multi-AZ or multi-region (not single point of failure)
# 2. Sufficient instance type (not burstable)
# 3. Enough connection headroom (PgBouncer)

# terraform/production.tf — minimum production configuration
resource "aws_db_instance" "postgres" {
  # ✅ NOT t3/t4g for production
  instance_class    = "db.r6g.large"

  # ✅ Multi-AZ for automatic failover
  multi_az          = true

  # ✅ Backup retention
  backup_retention_period = 14

  # ✅ Storage type that doesn't burst-and-throttle
  storage_type      = "gp3"
  iops              = 3000   # Dedicated IOPS, not burst

  # ✅ Storage alarm before you run out
  allocated_storage = 100
  # Set alarm at 80% usage: 80GB
}

resource "aws_cloudwatch_metric_alarm" "rds_storage_high" {
  alarm_name  = "rds-storage-high"
  namespace   = "AWS/RDS"
  metric_name = "FreeStorageSpace"

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.postgres.id
  }

  comparison_operator = "LessThanThreshold"
  threshold           = 20 * 1024 * 1024 * 1024  # 20GB free
  evaluation_periods  = 2
  period              = 300
  statistic           = "Average"

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Fix 4: Connection Pool Sizing That Matches Instance Limits

// Database connection pool must be sized for the instance, not just the app

// RDS max_connections formula (rough):
// t3.micro  (1GB RAM):  max_connections ≈ 85
// t3.small  (2GB RAM):  max_connections ≈ 170
// r6g.large (16GB RAM): max_connections ≈ 1330

// With 10 app instances, each needing 20 connections:
// Total connections needed: 200
// t3.micro can handle: 85 → YOU WILL HIT FATAL: too many connections

// ✅ Solution 1: Use PgBouncer
// App connects to PgBouncer (handles unlimited connections)
// PgBouncer maintains small pool to RDS (e.g., 50 connections)

// ✅ Solution 2: Size instance for connection count
// If you need 200 connections: use at least r6g.large or m6g.large

// ✅ Solution 3: Reduce pool size per instance
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,  // Not 20 or 50 — calculate: (max_connections * 0.8) / num_instances
  min: 2,
  idleTimeoutMillis: 30_000,
  connectionTimeoutMillis: 10_000,
})

// Alert when pool is near capacity
pool.on('connect', () => {
  if (pool.totalCount > pool.options.max! * 0.8) {
    logger.warn({ poolSize: pool.totalCount, max: pool.options.max },
      'Connection pool near capacity')
  }
})

Fix 5: Capacity Planning for Known Growth

// capacity-planner.ts — model resource needs before they become incidents
function calculateMinCapacity(params: {
  peakRequestsPerSecond: number
  avgResponseTimeMs: number
  cpuPerRequestMs: number  // CPU-ms consumed per request
  safetyFactor: number     // 2x = 50% headroom at peak
}) {
  const { peakRequestsPerSecond, avgResponseTimeMs, cpuPerRequestMs, safetyFactor } = params

  // CPU cores needed at peak
  const cpuCoresNeeded =
    (peakRequestsPerSecond * cpuPerRequestMs) / 1000 * safetyFactor

  // Memory rough estimate
  const concurrentRequests = peakRequestsPerSecond * (avgResponseTimeMs / 1000)

  console.log('Capacity requirements at peak load:')
  console.log(`  Peak RPS: ${peakRequestsPerSecond}`)
  console.log(`  CPU cores: ${cpuCoresNeeded.toFixed(1)} (with ${safetyFactor}x safety factor)`)
  console.log(`  Concurrent requests: ${concurrentRequests.toFixed(0)}`)
}

// Example: 500 RPS, 100ms avg response, 10ms CPU per request, 2x safety
calculateMinCapacity({
  peakRequestsPerSecond: 500,
  avgResponseTimeMs: 100,
  cpuPerRequestMs: 10,
  safetyFactor: 2,
})
// Output: 10 CPU cores needed → 5 x 2-vCPU instances minimum

Underprovisioning Prevention Checklist

  • ✅ Load tested at 2x expected peak BEFORE launch, with production-scale data
  • ✅ No burstable instances (t3/t4g) for production databases
  • ✅ All stateful resources are multi-AZ with automatic failover
  • ✅ Database connection limits calculated and respected — PgBouncer if needed
  • ✅ gp3 storage with dedicated IOPS (not burst credits) for write-heavy workloads
  • ✅ CloudWatch alarms: CPU > 70%, memory > 80%, storage < 20% free
  • ✅ Capacity planning done before major launches or expected traffic growth

Conclusion

Underprovisioning is a false economy. The t3.micro database saves 150/monthuntilitOOMsduringaproductlaunchandcosts150/month until it OOMs during a product launch and costs 50,000 in lost revenue and customer churn. The minimum viable production configuration for any stateful component is: a non-burstable instance type, multi-AZ redundancy, and CloudWatch alarms that fire before resources are exhausted. The cost difference between "will survive a traffic spike" and "might survive a traffic spike" is usually less than $200/month. That's not a cost optimization decision — it's a reliability decision.