- Published on
Cloud Cost Explosion — The $47,000 AWS Bill That Nobody Saw Coming
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Cloud cost explosions happen when resource consumption grows faster than anyone is watching. An engineer adds S3 replication for durability. A feature launches that stores large files. Auto-scaling adds instances that never scale back down. Data transfer charges accumulate. Each individual decision was reasonable; together, they compound into a bill that makes the CFO call the CTO at 6 AM. The fix isn't about being cheap — it's about having visibility before the bill arrives.
- Common Sources of Unexpected Cloud Bills
- Fix 1: AWS Budget Alerts Before the Damage Is Done
- Fix 2: Cost Anomaly Detection
- Fix 3: Resource Tagging for Cost Attribution
- Fix 4: Automatic Cleanup of Orphaned Resources
- Fix 5: S3 and Data Transfer Optimization
- Cost Control Checklist
- Conclusion
Common Sources of Unexpected Cloud Bills
Typical surprise cost items:
1. Data transfer out — often the biggest surprise
→ EC2 → Internet: $0.09/GB
→ CDN-bypassed image serving: 500GB/day = $1,350/month
→ Multi-region replication: data crosses regions
2. NAT Gateway — invisible until the bill arrives
→ $0.045/GB through NAT Gateway
→ All your private instances send traffic through it
→ Lambda functions in VPC: every cold start adds NAT traffic
3. RDS storage and I/O
→ Storage autoscales up but doesn't scale down
→ Provisioned IOPS: $0.10/GB-month + $0.065/IOPS-month
→ Multi-AZ doubles storage costs
4. Orphaned resources
→ Load balancers with no targets: $0.008/LCU-hour
→ Unattached EBS volumes: $0.10/GB-month
→ Old snapshots: $0.05/GB-month, accumulate forever
5. S3 request costs
→ S3 PUT: $0.005/1000 requests
→ 100M PUTs/day on a log bucket = $15,000/month
→ Lifecycle policies can eliminate this
Fix 1: AWS Budget Alerts Before the Damage Is Done
# Set budget alerts — should be the FIRST thing you do in a new AWS account
# Create a monthly budget with alerts at 50%, 80%, 100%, 120% of threshold
aws budgets create-budget \
--account-id "$AWS_ACCOUNT_ID" \
--budget '{
"BudgetName": "Monthly-Total",
"BudgetLimit": {
"Amount": "5000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{"SubscriptionType": "EMAIL", "Address": "engineering@myapp.com"},
{"SubscriptionType": "SNS", "Address": "arn:aws:sns:us-east-1:123:cost-alerts"}
]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{"SubscriptionType": "EMAIL", "Address": "cto@myapp.com"}
]
}
]'
Fix 2: Cost Anomaly Detection
// cost-monitor.ts — daily cost check with anomaly detection
import { CostExplorerClient, GetCostAndUsageCommand } from '@aws-sdk/client-cost-explorer'
const ce = new CostExplorerClient({ region: 'us-east-1' })
async function checkDailyCostAnomaly() {
const today = new Date().toISOString().split('T')[0]
const weekAgo = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000).toISOString().split('T')[0]
const { ResultsByTime } = await ce.send(new GetCostAndUsageCommand({
TimePeriod: { Start: weekAgo, End: today },
Granularity: 'DAILY',
Metrics: ['BlendedCost'],
GroupBy: [{ Type: 'DIMENSION', Key: 'SERVICE' }],
}))
const dailyCosts = ResultsByTime!.map(day => ({
date: day.TimePeriod!.Start,
total: parseFloat(day.Total!.BlendedCost!.Amount!),
byService: Object.fromEntries(
(day.Groups ?? []).map(g => [
g.Keys![0],
parseFloat(g.Metrics!.BlendedCost.Amount!),
])
),
}))
// Calculate 7-day average and compare to yesterday
const last7 = dailyCosts.slice(0, -1)
const yesterday = dailyCosts[dailyCosts.length - 1]
const avg7day = last7.reduce((sum, d) => sum + d.total, 0) / last7.length
const anomalyThreshold = avg7day * 1.5 // 50% above average
if (yesterday.total > anomalyThreshold) {
await alerting.critical(
`Cost anomaly: yesterday was $${yesterday.total.toFixed(2)} ` +
`vs 7-day avg $${avg7day.toFixed(2)} — investigate!`
)
// Find which service spiked
const avgByService = {}
last7.forEach(day => {
Object.entries(day.byService).forEach(([svc, cost]) => {
avgByService[svc] = (avgByService[svc] ?? 0) + cost / 7
})
})
const spikes = Object.entries(yesterday.byService)
.filter(([svc, cost]) => cost > (avgByService[svc] ?? 0) * 2)
.sort(([, a], [, b]) => b - a)
if (spikes.length > 0) {
logger.warn({ spikes }, 'Services with cost spikes')
}
}
}
Fix 3: Resource Tagging for Cost Attribution
// Every resource must have environment, team, and service tags
// This lets you answer "which team is spending $20k/month?"
// terraform/tagging.tf
locals {
common_tags = {
Environment = var.environment # "production", "staging"
Team = var.team # "platform", "payments", "growth"
Service = var.service # "api", "worker", "ml-inference"
ManagedBy = "terraform"
CostCenter = var.cost_center # For financial attribution
}
}
resource "aws_instance" "api" {
# ...
tags = merge(local.common_tags, {
Name = "${var.service}-api"
})
}
// AWS Cost Explorer grouped by team tag:
// Team=platform: $8,000/month
// Team=payments: $3,500/month
// Team=growth: $1,200/month
// Untagged: $4,800/month ← investigate these
Fix 4: Automatic Cleanup of Orphaned Resources
// orphan-cleaner.ts — runs weekly, flags or deletes unused resources
async function findOrphanedResources() {
const orphans: string[] = []
// Unattached EBS volumes
const volumes = await ec2.describeVolumes({
Filters: [{ Name: 'status', Values: ['available'] }], // Not attached
}).promise()
for (const vol of volumes.Volumes ?? []) {
const ageHours = (Date.now() - new Date(vol.CreateTime!).getTime()) / 3600000
if (ageHours > 24) {
orphans.push(`EBS volume ${vol.VolumeId}: ${vol.Size}GB, unattached for ${ageHours.toFixed(0)}h`)
}
}
// Load balancers with no targets
const lbs = await elbv2.describeLoadBalancers({}).promise()
for (const lb of lbs.LoadBalancers ?? []) {
const targetGroups = await elbv2.describeTargetGroups({
LoadBalancerArn: lb.LoadBalancerArn,
}).promise()
for (const tg of targetGroups.TargetGroups ?? []) {
const health = await elbv2.describeTargetHealth({
TargetGroupArn: tg.TargetGroupArn!,
}).promise()
if (health.TargetHealthDescriptions?.length === 0) {
orphans.push(`Load Balancer ${lb.LoadBalancerName}: no healthy targets`)
}
}
}
if (orphans.length > 0) {
await alerting.warn(`Orphaned resources found:\n${orphans.join('\n')}`)
}
return orphans
}
Fix 5: S3 and Data Transfer Optimization
// S3 lifecycle policies to prevent infinite growth
// Apply to log buckets, temp files, old exports
aws s3api put-bucket-lifecycle-configuration \
--bucket myapp-logs \
--lifecycle-configuration '{
"Rules": [
{
"ID": "log-retention",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA" // Cheaper for infrequent access
},
{
"Days": 90,
"StorageClass": "GLACIER" // Archival
}
],
"Expiration": {
"Days": 365 // Delete after 1 year
}
}
]
}'
// Reduce data transfer: use CloudFront in front of S3 and EC2
// Data transfer CloudFront→Internet: $0.0085/GB
// Data transfer EC2→Internet: $0.09/GB
// 10x cheaper for user-facing content
// Also: serve static assets from CDN, not your app servers
// Even for a small app: 100GB/month of images
// Without CDN: $9/month
// With CDN: $0.85/month + cleaner separation of concerns
Cost Control Checklist
- ✅ Budget alerts set at 50%, 80%, 100% of expected monthly spend
- ✅ Cost anomaly detection alert fires when daily cost is 50%+ above 7-day average
- ✅ Every resource tagged with environment, team, and service
- ✅ Weekly orphan resource scan — unattached volumes, empty load balancers, old snapshots
- ✅ S3 lifecycle policies on all log and temp buckets
- ✅ CloudFront CDN in front of all user-facing content
- ✅ Monthly cost review with team breakdown by tag
- ✅ RDS storage monitoring — prevent runaway autoscaling
Conclusion
Cloud cost explosions are a visibility problem. The bill is the last place you should learn about cost growth. The first place should be a daily anomaly alert that fires when yesterday's cost is 50% above last week's average. Then comes attribution: every resource tagged so you can answer "which team and service is driving this growth?" The cleanup mechanisms — lifecycle policies, orphan scanners, auto-scaling reviews — prevent the gradual accumulation of forgotten resources. Budget alerts buy you time; tagging tells you where to look; cleanup keeps the baseline honest.