- Published on
Cost-Aware Architecture — Engineering for Economics From Day One
- Authors
- Name
Introduction
AI companies die from cost, not from bugs. A single runaway feature can drain your runway in days. Cost visibility must be as first-class as performance or reliability. This post covers engineering practices that keep costs manageable and predictable.
- Cost Visibility as a First-Class Concern
- Tagging Resources by Team, Feature, and Customer
- Per-Request Cost Calculation
- Cost Limits and Circuit Breakers
- Async Offloading for Expensive Operations
- Caching ROI Calculation
- Right-Sizing Based on Actual Metrics
- Spot Instances for Batch AI Workloads
- S3 Intelligent Tiering
- Cost Anomaly Detection and Alerting
- Unit Economics Dashboard
- Checklist
- Conclusion
Cost Visibility as a First-Class Concern
You wouldn't ship code without monitoring performance. Don't ship AI features without cost monitoring.
// Database connection: built-in query instrumentation
const pool = new Pool({
host: 'localhost',
port: 5432,
query: (query) => {
const cost = estimateQueryCost(query);
metrics.histogram('db.query_cost_usd', cost, {
query: query.name || 'unknown'
});
}
});
// HTTP middleware: log cost per request
app.use(async (req, res, next) => {
const startCost = await estimateTotalCost();
const startTime = Date.now();
res.on('finish', async () => {
const endCost = await estimateTotalCost();
const requestCost = endCost - startCost;
const duration = Date.now() - startTime;
logger.info('Request completed', {
method: req.method,
path: req.path,
statusCode: res.statusCode,
durationMs: duration,
costUsd: requestCost,
userId: req.user?.id,
costPerSecond: requestCost / (duration / 1000)
});
metrics.histogram('request.cost_usd', requestCost, {
method: req.method,
path: req.path
});
});
next();
});
Every request should be tagged with its cost. This is not optional.
Tagging Resources by Team, Feature, and Customer
Your cloud bill is $50K/month. Where does it go? Without tags, you can't answer this.
AWS example:
// When creating resources, tag everything
const s3 = new AWS.S3();
const params = {
Bucket: 'my-bucket',
Key: 'data.json',
Body: JSON.stringify(data),
Metadata: {
'team': 'ai-platform',
'feature': 'semantic-search',
'customer': req.user.tenantId,
'env': process.env.NODE_ENV
}
};
await s3.putObject(params).promise();
// In CloudWatch, filter by tag
aws ce get-cost-and-usage \
--time-period Start=2026-03-01,End=2026-03-31 \
--granularity MONTHLY \
--metrics UnblendedCost \
--group-by Type=TAG,Key=team \
--filter file://filter.json
Tag at the resource level, too:
// EC2 instances
const ec2 = new AWS.EC2();
const params = {
ImageId: 'ami-12345678',
MinCount: 1,
MaxCount: 1,
TagSpecifications: [{
ResourceType: 'instance',
Tags: [
{ Key: 'team', Value: 'ai-platform' },
{ Key: 'feature', Value: 'embeddings-generation' },
{ Key: 'tenant', Value: req.user.tenantId }
]
}]
};
await ec2.runInstances(params).promise();
Spend an hour tagging. Save thousands in misallocated costs.
Per-Request Cost Calculation
Calculate cost inline. Don't estimate. Measure.
async function calculateRequestCost(req) {
let cost = 0;
// LLM tokens
if (req.llmTokens) {
const llmCost = (req.llmTokens / 1000) * 0.01; // adjust pricing
cost += llmCost;
}
// Database queries
if (req.dbQueries) {
// Example: $0.15 per million queries on DynamoDB
const dbCost = (req.dbQueries / 1000000) * 0.15;
cost += dbCost;
}
// Compute time
if (req.cpuSeconds) {
const computeCost = (req.cpuSeconds / 3600) * 0.05; // $0.05/CPU hour
cost += computeCost;
}
// Network egress
if (req.egressGb) {
const networkCost = req.egressGb * 0.09; // $0.09 per GB
cost += networkCost;
}
return {
total: cost,
breakdown: {
llm: llmCost,
database: dbCost,
compute: computeCost,
network: networkCost
}
};
}
// Record cost after every request
app.use(async (req, res, next) => {
// ... process request ...
const cost = await calculateRequestCost(req);
await db.costs.insert({
userId: req.user.id,
tenantId: req.user.tenantId,
feature: req.feature,
costUsd: cost.total,
breakdown: cost.breakdown,
timestamp: new Date()
});
next();
});
Without per-request costs, you'll discover expensive features only after they've cost thousands.
Cost Limits and Circuit Breakers
Set hard limits. Fail gracefully when limits are hit.
class CostCircuitBreaker {
async checkLimit(userId, tenantId) {
const limits = await this.getLimits(userId);
const totalCost = await db.costs.sumByUser(userId, this.month());
if (totalCost >= limits.hardLimit) {
return {
allowed: false,
reason: 'Hard cost limit exceeded',
limit: limits.hardLimit,
spent: totalCost
};
}
if (totalCost >= limits.warningThreshold) {
await sendWarningEmail(userId, {
spent: totalCost,
limit: limits.hardLimit,
remaining: limits.hardLimit - totalCost
});
}
return { allowed: true };
}
}
// Middleware
app.use(async (req, res, next) => {
const breaker = new CostCircuitBreaker();
const check = await breaker.checkLimit(req.user.id, req.user.tenantId);
if (!check.allowed) {
return res.status(429).json({
error: 'Cost limit exceeded',
message: check.reason,
limit: check.limit,
spent: check.spent
});
}
next();
});
Hard limits prevent surprises. Soft limits (warnings) encourage responsible usage.
Async Offloading for Expensive Operations
Never block a user request on expensive work. Offload to background jobs.
// Bad: user waits for expensive embedding
app.post('/api/process', async (req, res) => {
const embedding = await openai.createEmbedding({
model: 'text-embedding-3-large',
input: req.body.text
});
res.json({ embedding });
});
// Good: async job
app.post('/api/process', async (req, res) => {
const jobId = uuidv4();
await queue.enqueue({
type: 'embed',
jobId,
text: req.body.text
});
res.json({ jobId });
});
// Worker process
worker.on('embed', async (job) => {
const embedding = await openai.createEmbedding({
model: 'text-embedding-3-large',
input: job.text
});
await db.embeddings.update(job.jobId, { embedding });
});
Offloading decouples cost from user latency. Batch expensive jobs and run them during off-peak hours when compute is cheaper.
Caching ROI Calculation
Caching saves money, but caching infrastructure costs money too. Calculate ROI.
class CachingROI {
constructor() {
this.cacheSize = 0; // bytes
this.cacheCostPerMonth = 50; // Redis instance cost
}
async shouldCache(key, value) {
const hitProbability = await this.estimateHitRate(key);
const valueCost = await this.estimateValueCost(value);
const storageCost = (value.length / 1024 / 1024) * 0.03; // $0.03 per MB per month
// Expected savings if cached
const monthlySavings = (valueCost * hitProbability * 30) - storageCost;
return monthlySavings > 0;
}
async estimateValueCost(value) {
// Cost of generating this value (embedding, LLM call, etc.)
if (value.isEmbedding) {
return (value.tokens / 1000) * 0.00002; // text-embedding-3-large cost
}
// ... other value types
}
async estimateHitRate(key) {
// Query historical data
const pattern = this.extractPattern(key);
const similar = await db.cache.countSimilar(pattern);
return Math.min(similar / 1000, 0.9);
}
}
// Use ROI calculator
const roi = new CachingROI();
if (await roi.shouldCache(key, value)) {
await cache.set(key, value);
}
Cache only high-ROI items. This prevents memory bloat and keeps infrastructure costs low.
Right-Sizing Based on Actual Metrics
Don't guess capacity. Measure.
// Collect actual usage patterns
const metrics = {
peakQPS: 0,
p95Latency: 0,
cpuUsage: 0,
memoryUsage: 0
};
setInterval(async () => {
const current = await getCurrentMetrics();
metrics.peakQPS = Math.max(metrics.peakQPS, current.qps);
metrics.p95Latency = await percentile(current.latencies, 0.95);
metrics.cpuUsage = current.cpu;
metrics.memoryUsage = current.memory;
// Recommend right-sized instance
const recommendation = this.recommendInstance(metrics);
logger.info('Sizing recommendation', recommendation);
}, 60000);
recommendInstance(metrics) {
// If we're using < 20% of instance capacity, downsize
if (metrics.cpuUsage < 0.2 && metrics.memoryUsage < 0.2) {
return { action: 'downsize', savings: 600 }; // Save $600/month
}
// If we're at 90%+ usage, upsize to avoid throttling
if (metrics.cpuUsage > 0.9 || metrics.memoryUsage > 0.9) {
return { action: 'upsize', impact: 'reliability' };
}
return { action: 'none' };
}
Review sizing monthly. Under-utilized instances leak money. Over-utilized instances cause outages.
Spot Instances for Batch AI Workloads
Spot instances cost < 1/3 of on-demand. Use them for batch jobs.
// On-demand: reliable, costs $1/hour
// Spot: cheap, can be interrupted, costs $0.30/hour
const params = {
MaxPrice: '0.40', // Max willing to pay per hour
SpotOptions: {
MarketType: 'spot',
SpotInstanceType: 'persistent',
InstanceInterruptionBehavior: 'terminate'
},
InstanceType: 't3.large',
ImageId: 'ami-12345678'
};
// For batch jobs that can restart: use spot
// For always-on services: use on-demand
// Example: nightly embedding regeneration
const job = {
type: 'regenerate_embeddings',
instanceType: 'spot', // Can be interrupted
maxPrice: 0.40,
retries: 3 // Will retry on interruption
};
Spot instances reduce compute costs by 60-70%. Reserve them for fault-tolerant batch jobs.
S3 Intelligent Tiering
Store data intelligently. Hot data in standard tier, cold data in cheaper tiers.
// Automatically tiered based on access patterns
const s3 = new AWS.S3();
const params = {
Bucket: 'my-bucket',
Key: 'data.json',
Body: JSON.stringify(data),
StorageClass: 'INTELLIGENT_TIERING'
};
await s3.putObject(params).promise();
// S3 automatically moves data:
// - Frequent access tier: $0.023 per GB
// - Infrequent access tier: $0.0125 per GB (30 days)
// - Archive access tier: $0.004 per GB (90 days)
// - Deep archive: $0.00099 per GB (180 days)
Enable intelligent tiering on all S3 buckets. It auto-tiers based on access patterns and saves 30-50% on storage costs.
Cost Anomaly Detection and Alerting
Alert when costs spike. Don't discover problems in hindsight.
class CostAnomalyDetector {
async detectAnomalies() {
// Get daily costs for last 30 days
const costs = await db.costs.dailySum(last30Days());
// Calculate mean and std deviation
const mean = mean(costs);
const stdDev = standardDeviation(costs);
const today = costs[costs.length - 1];
// Flag if today is > 2 std devs above mean
if (today > mean + (2 * stdDev)) {
await sendAlert({
type: 'COST_ANOMALY',
todaysCost: today,
expectedCost: mean,
deviation: Math.round((today - mean) / mean * 100)
});
}
}
}
// Run daily
scheduler.every('1 day', async () => {
const detector = new CostAnomalyDetector();
await detector.detectAnomalies();
});
Anomaly detection catches runaway costs early. A cost spike detected after 1 day instead of 30 days saves thousands.
Unit Economics Dashboard
Track unit economics at the business level.
// Dashboard query
async function unitEconomics() {
const revenue = await db.revenue.sumByMonth(currentMonth());
const costs = await db.costs.sumByMonth(currentMonth());
const activeUsers = await db.users.countActive(currentMonth());
return {
revenue,
costs,
grossMargin: (revenue - costs) / revenue,
costPerActiveUser: costs / activeUsers,
revenuePerActiveUser: revenue / activeUsers,
profitPerActiveUser: (revenue - costs) / activeUsers
};
}
Share this dashboard with non-technical stakeholders. It's the single source of truth for business health.
Checklist
- Log per-request costs to a cost tracking table
- Tag all resources (AWS, GCP, Azure) by team, feature, customer
- Calculate embedding, LLM, and database costs inline
- Set hard and soft cost limits per user/tenant
- Offload expensive operations to async jobs
- Calculate caching ROI before caching
- Review capacity monthly; downsize under-utilized instances
- Use spot instances for batch AI workloads (70% savings)
- Enable S3 intelligent tiering (30-50% savings)
- Set up cost anomaly detection and daily alerts
- Build unit economics dashboard; review weekly
Conclusion
Cost engineering is as important as performance engineering. Build visibility at request granularity, set limits, right-size infrastructure, and use cheaper compute where possible. The difference between a sustainable AI product and a money-losing one is engineering discipline around costs.