Cost-Aware Architecture — Engineering for Economics From Day One

Introduction

AI companies die from cost, not from bugs. A single runaway feature can drain your runway in days. Cost visibility must be as first-class as performance or reliability. This post covers engineering practices that keep costs manageable and predictable.

Cost Visibility as a First-Class Concern
Tagging Resources by Team, Feature, and Customer
Per-Request Cost Calculation
Cost Limits and Circuit Breakers
Async Offloading for Expensive Operations
Caching ROI Calculation
Right-Sizing Based on Actual Metrics
Spot Instances for Batch AI Workloads
S3 Intelligent Tiering
Cost Anomaly Detection and Alerting
Unit Economics Dashboard
Checklist
Conclusion

Cost Visibility as a First-Class Concern

You wouldn't ship code without monitoring performance. Don't ship AI features without cost monitoring.

// Database connection: built-in query instrumentation
const pool = new Pool({
  host: 'localhost',
  port: 5432,
  query: (query) => {
    const cost = estimateQueryCost(query);
    metrics.histogram('db.query_cost_usd', cost, {
      query: query.name || 'unknown'
    });
  }
});

// HTTP middleware: log cost per request
app.use(async (req, res, next) => {
  const startCost = await estimateTotalCost();
  const startTime = Date.now();

  res.on('finish', async () => {
    const endCost = await estimateTotalCost();
    const requestCost = endCost - startCost;
    const duration = Date.now() - startTime;

    logger.info('Request completed', {
      method: req.method,
      path: req.path,
      statusCode: res.statusCode,
      durationMs: duration,
      costUsd: requestCost,
      userId: req.user?.id,
      costPerSecond: requestCost / (duration / 1000)
    });

    metrics.histogram('request.cost_usd', requestCost, {
      method: req.method,
      path: req.path
    });
  });

  next();
});

Every request should be tagged with its cost. This is not optional.

Tagging Resources by Team, Feature, and Customer

Your cloud bill is $50K/month. Where does it go? Without tags, you can't answer this.

AWS example:

// When creating resources, tag everything
const s3 = new AWS.S3();
const params = {
  Bucket: 'my-bucket',
  Key: 'data.json',
  Body: JSON.stringify(data),
  Metadata: {
    'team': 'ai-platform',
    'feature': 'semantic-search',
    'customer': req.user.tenantId,
    'env': process.env.NODE_ENV
  }
};

await s3.putObject(params).promise();

// In CloudWatch, filter by tag
aws ce get-cost-and-usage \
  --time-period Start=2026-03-01,End=2026-03-31 \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=TAG,Key=team \
  --filter file://filter.json

Tag at the resource level, too:

// EC2 instances
const ec2 = new AWS.EC2();
const params = {
  ImageId: 'ami-12345678',
  MinCount: 1,
  MaxCount: 1,
  TagSpecifications: [{
    ResourceType: 'instance',
    Tags: [
      { Key: 'team', Value: 'ai-platform' },
      { Key: 'feature', Value: 'embeddings-generation' },
      { Key: 'tenant', Value: req.user.tenantId }
    ]
  }]
};

await ec2.runInstances(params).promise();

Spend an hour tagging. Save thousands in misallocated costs.

Per-Request Cost Calculation

Calculate cost inline. Don't estimate. Measure.

async function calculateRequestCost(req) {
  let cost = 0;

  // LLM tokens
  if (req.llmTokens) {
    const llmCost = (req.llmTokens / 1000) * 0.01; // adjust pricing
    cost += llmCost;
  }

  // Database queries
  if (req.dbQueries) {
    // Example: $0.15 per million queries on DynamoDB
    const dbCost = (req.dbQueries / 1000000) * 0.15;
    cost += dbCost;
  }

  // Compute time
  if (req.cpuSeconds) {
    const computeCost = (req.cpuSeconds / 3600) * 0.05; // $0.05/CPU hour
    cost += computeCost;
  }

  // Network egress
  if (req.egressGb) {
    const networkCost = req.egressGb * 0.09; // $0.09 per GB
    cost += networkCost;
  }

  return {
    total: cost,
    breakdown: {
      llm: llmCost,
      database: dbCost,
      compute: computeCost,
      network: networkCost
    }
  };
}

// Record cost after every request
app.use(async (req, res, next) => {
  // ... process request ...
  const cost = await calculateRequestCost(req);

  await db.costs.insert({
    userId: req.user.id,
    tenantId: req.user.tenantId,
    feature: req.feature,
    costUsd: cost.total,
    breakdown: cost.breakdown,
    timestamp: new Date()
  });

  next();
});

Without per-request costs, you'll discover expensive features only after they've cost thousands.

Cost Limits and Circuit Breakers

Set hard limits. Fail gracefully when limits are hit.

class CostCircuitBreaker {
  async checkLimit(userId, tenantId) {
    const limits = await this.getLimits(userId);
    const totalCost = await db.costs.sumByUser(userId, this.month());

    if (totalCost &gt;= limits.hardLimit) {
      return {
        allowed: false,
        reason: 'Hard cost limit exceeded',
        limit: limits.hardLimit,
        spent: totalCost
      };
    }

    if (totalCost &gt;= limits.warningThreshold) {
      await sendWarningEmail(userId, {
        spent: totalCost,
        limit: limits.hardLimit,
        remaining: limits.hardLimit - totalCost
      });
    }

    return { allowed: true };
  }
}

// Middleware
app.use(async (req, res, next) => {
  const breaker = new CostCircuitBreaker();
  const check = await breaker.checkLimit(req.user.id, req.user.tenantId);

  if (!check.allowed) {
    return res.status(429).json({
      error: 'Cost limit exceeded',
      message: check.reason,
      limit: check.limit,
      spent: check.spent
    });
  }

  next();
});

Hard limits prevent surprises. Soft limits (warnings) encourage responsible usage.

Async Offloading for Expensive Operations

Never block a user request on expensive work. Offload to background jobs.

// Bad: user waits for expensive embedding
app.post('/api/process', async (req, res) => {
  const embedding = await openai.createEmbedding({
    model: 'text-embedding-3-large',
    input: req.body.text
  });
  res.json({ embedding });
});

// Good: async job
app.post('/api/process', async (req, res) => {
  const jobId = uuidv4();
  await queue.enqueue({
    type: 'embed',
    jobId,
    text: req.body.text
  });
  res.json({ jobId });
});

// Worker process
worker.on('embed', async (job) => {
  const embedding = await openai.createEmbedding({
    model: 'text-embedding-3-large',
    input: job.text
  });
  await db.embeddings.update(job.jobId, { embedding });
});

Offloading decouples cost from user latency. Batch expensive jobs and run them during off-peak hours when compute is cheaper.

Caching ROI Calculation

Caching saves money, but caching infrastructure costs money too. Calculate ROI.

class CachingROI {
  constructor() {
    this.cacheSize = 0; // bytes
    this.cacheCostPerMonth = 50; // Redis instance cost
  }

  async shouldCache(key, value) {
    const hitProbability = await this.estimateHitRate(key);
    const valueCost = await this.estimateValueCost(value);
    const storageCost = (value.length / 1024 / 1024) * 0.03; // $0.03 per MB per month

    // Expected savings if cached
    const monthlySavings = (valueCost * hitProbability * 30) - storageCost;

    return monthlySavings &gt; 0;
  }

  async estimateValueCost(value) {
    // Cost of generating this value (embedding, LLM call, etc.)
    if (value.isEmbedding) {
      return (value.tokens / 1000) * 0.00002; // text-embedding-3-large cost
    }
    // ... other value types
  }

  async estimateHitRate(key) {
    // Query historical data
    const pattern = this.extractPattern(key);
    const similar = await db.cache.countSimilar(pattern);
    return Math.min(similar / 1000, 0.9);
  }
}

// Use ROI calculator
const roi = new CachingROI();
if (await roi.shouldCache(key, value)) {
  await cache.set(key, value);
}

Cache only high-ROI items. This prevents memory bloat and keeps infrastructure costs low.

Right-Sizing Based on Actual Metrics

Don't guess capacity. Measure.

// Collect actual usage patterns
const metrics = {
  peakQPS: 0,
  p95Latency: 0,
  cpuUsage: 0,
  memoryUsage: 0
};

setInterval(async () => {
  const current = await getCurrentMetrics();

  metrics.peakQPS = Math.max(metrics.peakQPS, current.qps);
  metrics.p95Latency = await percentile(current.latencies, 0.95);
  metrics.cpuUsage = current.cpu;
  metrics.memoryUsage = current.memory;

  // Recommend right-sized instance
  const recommendation = this.recommendInstance(metrics);
  logger.info('Sizing recommendation', recommendation);
}, 60000);

recommendInstance(metrics) {
  // If we're using &lt; 20% of instance capacity, downsize
  if (metrics.cpuUsage &lt; 0.2 && metrics.memoryUsage &lt; 0.2) {
    return { action: 'downsize', savings: 600 }; // Save $600/month
  }

  // If we're at 90%+ usage, upsize to avoid throttling
  if (metrics.cpuUsage &gt; 0.9 || metrics.memoryUsage &gt; 0.9) {
    return { action: 'upsize', impact: 'reliability' };
  }

  return { action: 'none' };
}

Review sizing monthly. Under-utilized instances leak money. Over-utilized instances cause outages.

Spot Instances for Batch AI Workloads

Spot instances cost < 1/3 of on-demand. Use them for batch jobs.

// On-demand: reliable, costs $1/hour
// Spot: cheap, can be interrupted, costs $0.30/hour

const params = {
  MaxPrice: '0.40', // Max willing to pay per hour
  SpotOptions: {
    MarketType: 'spot',
    SpotInstanceType: 'persistent',
    InstanceInterruptionBehavior: 'terminate'
  },
  InstanceType: 't3.large',
  ImageId: 'ami-12345678'
};

// For batch jobs that can restart: use spot
// For always-on services: use on-demand

// Example: nightly embedding regeneration
const job = {
  type: 'regenerate_embeddings',
  instanceType: 'spot', // Can be interrupted
  maxPrice: 0.40,
  retries: 3 // Will retry on interruption
};

Spot instances reduce compute costs by 60-70%. Reserve them for fault-tolerant batch jobs.

S3 Intelligent Tiering

Store data intelligently. Hot data in standard tier, cold data in cheaper tiers.

// Automatically tiered based on access patterns
const s3 = new AWS.S3();
const params = {
  Bucket: 'my-bucket',
  Key: 'data.json',
  Body: JSON.stringify(data),
  StorageClass: 'INTELLIGENT_TIERING'
};

await s3.putObject(params).promise();

// S3 automatically moves data:
// - Frequent access tier: $0.023 per GB
// - Infrequent access tier: $0.0125 per GB (30 days)
// - Archive access tier: $0.004 per GB (90 days)
// - Deep archive: $0.00099 per GB (180 days)

Enable intelligent tiering on all S3 buckets. It auto-tiers based on access patterns and saves 30-50% on storage costs.

Cost Anomaly Detection and Alerting

Alert when costs spike. Don't discover problems in hindsight.

class CostAnomalyDetector {
  async detectAnomalies() {
    // Get daily costs for last 30 days
    const costs = await db.costs.dailySum(last30Days());

    // Calculate mean and std deviation
    const mean = mean(costs);
    const stdDev = standardDeviation(costs);
    const today = costs[costs.length - 1];

    // Flag if today is &gt; 2 std devs above mean
    if (today &gt; mean + (2 * stdDev)) {
      await sendAlert({
        type: 'COST_ANOMALY',
        todaysCost: today,
        expectedCost: mean,
        deviation: Math.round((today - mean) / mean * 100)
      });
    }
  }
}

// Run daily
scheduler.every('1 day', async () => {
  const detector = new CostAnomalyDetector();
  await detector.detectAnomalies();
});

Anomaly detection catches runaway costs early. A cost spike detected after 1 day instead of 30 days saves thousands.

Unit Economics Dashboard

Track unit economics at the business level.

// Dashboard query
async function unitEconomics() {
  const revenue = await db.revenue.sumByMonth(currentMonth());
  const costs = await db.costs.sumByMonth(currentMonth());
  const activeUsers = await db.users.countActive(currentMonth());

  return {
    revenue,
    costs,
    grossMargin: (revenue - costs) / revenue,
    costPerActiveUser: costs / activeUsers,
    revenuePerActiveUser: revenue / activeUsers,
    profitPerActiveUser: (revenue - costs) / activeUsers
  };
}

Share this dashboard with non-technical stakeholders. It's the single source of truth for business health.

Checklist

Log per-request costs to a cost tracking table
Tag all resources (AWS, GCP, Azure) by team, feature, customer
Calculate embedding, LLM, and database costs inline
Set hard and soft cost limits per user/tenant
Offload expensive operations to async jobs
Calculate caching ROI before caching
Review capacity monthly; downsize under-utilized instances
Use spot instances for batch AI workloads (70% savings)
Enable S3 intelligent tiering (30-50% savings)
Set up cost anomaly detection and daily alerts
Build unit economics dashboard; review weekly

Conclusion

Cost engineering is as important as performance engineering. Build visibility at request granularity, set limits, right-size infrastructure, and use cheaper compute where possible. The difference between a sustainable AI product and a money-losing one is engineering discipline around costs.