Published on

Cloud Cost Explosion — The $47,000 AWS Bill That Nobody Saw Coming

Authors

Introduction

Cloud cost explosions happen when resource consumption grows faster than anyone is watching. An engineer adds S3 replication for durability. A feature launches that stores large files. Auto-scaling adds instances that never scale back down. Data transfer charges accumulate. Each individual decision was reasonable; together, they compound into a bill that makes the CFO call the CTO at 6 AM. The fix isn't about being cheap — it's about having visibility before the bill arrives.

Common Sources of Unexpected Cloud Bills

Typical surprise cost items:

1. Data transfer out — often the biggest surprise
EC2Internet: $0.09/GB
CDN-bypassed image serving: 500GB/day = $1,350/month
Multi-region replication: data crosses regions

2. NAT Gateway — invisible until the bill arrives
   → $0.045/GB through NAT Gateway
All your private instances send traffic through it
Lambda functions in VPC: every cold start adds NAT traffic

3. RDS storage and I/O
Storage autoscales up but doesn't scale down
Provisioned IOPS: $0.10/GB-month + $0.065/IOPS-month
Multi-AZ doubles storage costs

4. Orphaned resources
Load balancers with no targets: $0.008/LCU-hour
Unattached EBS volumes: $0.10/GB-month
Old snapshots: $0.05/GB-month, accumulate forever

5. S3 request costs
S3 PUT: $0.005/1000 requests
   → 100M PUTs/day on a log bucket = $15,000/month
Lifecycle policies can eliminate this

Fix 1: AWS Budget Alerts Before the Damage Is Done

# Set budget alerts — should be the FIRST thing you do in a new AWS account

# Create a monthly budget with alerts at 50%, 80%, 100%, 120% of threshold
aws budgets create-budget \
  --account-id "$AWS_ACCOUNT_ID" \
  --budget '{
    "BudgetName": "Monthly-Total",
    "BudgetLimit": {
      "Amount": "5000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {"SubscriptionType": "EMAIL", "Address": "engineering@myapp.com"},
        {"SubscriptionType": "SNS", "Address": "arn:aws:sns:us-east-1:123:cost-alerts"}
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {"SubscriptionType": "EMAIL", "Address": "cto@myapp.com"}
      ]
    }
  ]'

Fix 2: Cost Anomaly Detection

// cost-monitor.ts — daily cost check with anomaly detection
import { CostExplorerClient, GetCostAndUsageCommand } from '@aws-sdk/client-cost-explorer'

const ce = new CostExplorerClient({ region: 'us-east-1' })

async function checkDailyCostAnomaly() {
  const today = new Date().toISOString().split('T')[0]
  const weekAgo = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000).toISOString().split('T')[0]

  const { ResultsByTime } = await ce.send(new GetCostAndUsageCommand({
    TimePeriod: { Start: weekAgo, End: today },
    Granularity: 'DAILY',
    Metrics: ['BlendedCost'],
    GroupBy: [{ Type: 'DIMENSION', Key: 'SERVICE' }],
  }))

  const dailyCosts = ResultsByTime!.map(day => ({
    date: day.TimePeriod!.Start,
    total: parseFloat(day.Total!.BlendedCost!.Amount!),
    byService: Object.fromEntries(
      (day.Groups ?? []).map(g => [
        g.Keys![0],
        parseFloat(g.Metrics!.BlendedCost.Amount!),
      ])
    ),
  }))

  // Calculate 7-day average and compare to yesterday
  const last7 = dailyCosts.slice(0, -1)
  const yesterday = dailyCosts[dailyCosts.length - 1]
  const avg7day = last7.reduce((sum, d) => sum + d.total, 0) / last7.length

  const anomalyThreshold = avg7day * 1.5  // 50% above average

  if (yesterday.total > anomalyThreshold) {
    await alerting.critical(
      `Cost anomaly: yesterday was $${yesterday.total.toFixed(2)} ` +
      `vs 7-day avg $${avg7day.toFixed(2)} — investigate!`
    )

    // Find which service spiked
    const avgByService = {}
    last7.forEach(day => {
      Object.entries(day.byService).forEach(([svc, cost]) => {
        avgByService[svc] = (avgByService[svc] ?? 0) + cost / 7
      })
    })

    const spikes = Object.entries(yesterday.byService)
      .filter(([svc, cost]) => cost > (avgByService[svc] ?? 0) * 2)
      .sort(([, a], [, b]) => b - a)

    if (spikes.length > 0) {
      logger.warn({ spikes }, 'Services with cost spikes')
    }
  }
}

Fix 3: Resource Tagging for Cost Attribution

// Every resource must have environment, team, and service tags
// This lets you answer "which team is spending $20k/month?"

// terraform/tagging.tf
locals {
  common_tags = {
    Environment = var.environment   # "production", "staging"
    Team        = var.team          # "platform", "payments", "growth"
    Service     = var.service       # "api", "worker", "ml-inference"
    ManagedBy   = "terraform"
    CostCenter  = var.cost_center   # For financial attribution
  }
}

resource "aws_instance" "api" {
  # ...
  tags = merge(local.common_tags, {
    Name = "${var.service}-api"
  })
}

// AWS Cost Explorer grouped by team tag:
// Team=platform:  $8,000/month
// Team=payments:  $3,500/month
// Team=growth:    $1,200/month
// Untagged:       $4,800/month  ← investigate these

Fix 4: Automatic Cleanup of Orphaned Resources

// orphan-cleaner.ts — runs weekly, flags or deletes unused resources
async function findOrphanedResources() {
  const orphans: string[] = []

  // Unattached EBS volumes
  const volumes = await ec2.describeVolumes({
    Filters: [{ Name: 'status', Values: ['available'] }],  // Not attached
  }).promise()

  for (const vol of volumes.Volumes ?? []) {
    const ageHours = (Date.now() - new Date(vol.CreateTime!).getTime()) / 3600000
    if (ageHours > 24) {
      orphans.push(`EBS volume ${vol.VolumeId}: ${vol.Size}GB, unattached for ${ageHours.toFixed(0)}h`)
    }
  }

  // Load balancers with no targets
  const lbs = await elbv2.describeLoadBalancers({}).promise()
  for (const lb of lbs.LoadBalancers ?? []) {
    const targetGroups = await elbv2.describeTargetGroups({
      LoadBalancerArn: lb.LoadBalancerArn,
    }).promise()

    for (const tg of targetGroups.TargetGroups ?? []) {
      const health = await elbv2.describeTargetHealth({
        TargetGroupArn: tg.TargetGroupArn!,
      }).promise()

      if (health.TargetHealthDescriptions?.length === 0) {
        orphans.push(`Load Balancer ${lb.LoadBalancerName}: no healthy targets`)
      }
    }
  }

  if (orphans.length > 0) {
    await alerting.warn(`Orphaned resources found:\n${orphans.join('\n')}`)
  }

  return orphans
}

Fix 5: S3 and Data Transfer Optimization

// S3 lifecycle policies to prevent infinite growth
// Apply to log buckets, temp files, old exports

aws s3api put-bucket-lifecycle-configuration \
  --bucket myapp-logs \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "log-retention",
        "Status": "Enabled",
        "Transitions": [
          {
            "Days": 30,
            "StorageClass": "STANDARD_IA"  // Cheaper for infrequent access
          },
          {
            "Days": 90,
            "StorageClass": "GLACIER"       // Archival
          }
        ],
        "Expiration": {
          "Days": 365                       // Delete after 1 year
        }
      }
    ]
  }'
// Reduce data transfer: use CloudFront in front of S3 and EC2
// Data transfer CloudFront→Internet: $0.0085/GB
// Data transfer EC2→Internet: $0.09/GB
// 10x cheaper for user-facing content

// Also: serve static assets from CDN, not your app servers
// Even for a small app: 100GB/month of images
// Without CDN: $9/month
// With CDN: $0.85/month + cleaner separation of concerns

Cost Control Checklist

  • ✅ Budget alerts set at 50%, 80%, 100% of expected monthly spend
  • ✅ Cost anomaly detection alert fires when daily cost is 50%+ above 7-day average
  • ✅ Every resource tagged with environment, team, and service
  • ✅ Weekly orphan resource scan — unattached volumes, empty load balancers, old snapshots
  • ✅ S3 lifecycle policies on all log and temp buckets
  • ✅ CloudFront CDN in front of all user-facing content
  • ✅ Monthly cost review with team breakdown by tag
  • ✅ RDS storage monitoring — prevent runaway autoscaling

Conclusion

Cloud cost explosions are a visibility problem. The bill is the last place you should learn about cost growth. The first place should be a daily anomaly alert that fires when yesterday's cost is 50% above last week's average. Then comes attribution: every resource tagged so you can answer "which team and service is driving this growth?" The cleanup mechanisms — lifecycle policies, orphan scanners, auto-scaling reviews — prevent the gradual accumulation of forgotten resources. Budget alerts buy you time; tagging tells you where to look; cleanup keeps the baseline honest.