Published on

System Design for AI-Powered Products — Architecture Decisions That Scale

Authors
  • Name
    Twitter

Introduction

Building AI-powered products at scale requires rethinking traditional system design principles. Unlike deterministic services, AI systems face unique challenges: non-deterministic outputs, variable latency, unpredictable costs, and dependencies on third-party APIs with rate limits. This post covers architectural patterns that handle these realities.

The Core Challenges of AI Architecture

AI systems differ fundamentally from traditional backends. A database query returns the same result every time; an LLM call doesn't. An API endpoint completes in 100ms; an LLM call takes 2-10 seconds. Costs scale with usage in unpredictable ways—one user's prompt might tokenize to 50 tokens while another's is 50,000.

These differences force architectural decisions early:

Non-determinism: You can't rely on response caching as aggressively as traditional backends. A request for "summarize my data" will produce different outputs each time, making deterministic caching risky.

Latency variability: A user request hitting an LLM directly ties up your entire request cycle. If the LLM takes 8 seconds, the user waits 8 seconds. If rate limits hit, everyone waits.

Cost opacity: Without careful metering, a single runaway query can cost hundreds of dollars. You need cost visibility at request granularity.

Async-First Design for LLM Calls

Never block user requests on LLM responses. Use async patterns everywhere.

// Bad: user waits for LLM
app.post('/api/summarize', async (req, res) => {
  const summary = await openai.createChatCompletion({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: req.body.text }]
  });
  res.json(summary);
});

// Good: async job, user gets immediate response
app.post('/api/summarize', async (req, res) => {
  const jobId = uuidv4();
  await queue.enqueue({
    type: 'summarize',
    jobId,
    userId: req.user.id,
    text: req.body.text
  });
  res.json({ jobId });
});

// Worker processes LLM call
worker.on('summarize', async (job) => {
  const summary = await openai.createChatCompletion({...});
  await db.summaries.update(job.jobId, { result: summary });
});

Queue-based architectures decouple LLM latency from user experience. Users get immediate feedback ("Your summary is processing") while expensive operations run in the background.

Streaming Architecture for Real-Time AI Responses

When users expect streaming responses (like ChatGPT), use server-sent events or WebSockets to push tokens as they arrive.

app.get('/api/stream', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');

  const stream = await openai.createChatCompletionStream({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: req.query.prompt }],
    stream: true
  });

  for await (const event of stream) {
    const delta = event.choices[0].delta.content || '';
    res.write(`data: ${JSON.stringify({ delta })}\n\n`);
  }
  res.end();
});

Streaming keeps users engaged by showing progress. Each token arriving is a signal that the system is thinking.

LLM Response Caching Strategy

Cache carefully. Two strategies work:

Exact match caching: Cache identical prompts with identical parameters. Useful for deterministic workflows.

const cacheKey = hash(JSON.stringify({
  model: 'gpt-4o',
  prompt: req.body.prompt,
  temperature: 0.7
}));

const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);

const response = await openai.createChatCompletion({...});
await redis.setex(cacheKey, 3600, JSON.stringify(response));

Semantic caching: Cache similar prompts. If "What's the weather in NYC?" was asked before, reuse it for "Weather in New York City?"—requires embedding similarity.

const promptEmbedding = await embed(req.body.prompt);
const similar = await vectordb.search(promptEmbedding, { threshold: 0.95 });
if (similar.length > 0) {
  return similar[0].cachedResponse;
}

Cache only when the cost of generating exceeds the cost of storage and lookup.

Fallback Chain Design

Design multilayered fallback chains:

async function getAIResponse(prompt, userId) {
  try {
    // Tier 1: Premium user? Use GPT-4o
    if (user.tier === 'premium') {
      return await openai.create({
        model: 'gpt-4o',
        messages: [{ role: 'user', content: prompt }]
      });
    }
  } catch (error) {
    // Fallback: cheaper model
  }

  try {
    // Tier 2: Standard model
    return await openai.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: prompt }]
    });
  } catch (error) {
    // Fallback: cache
  }

  // Tier 3: Return previously cached response for similar prompt
  const similar = await cache.findSimilar(prompt);
  if (similar) return similar.response;

  throw new Error('All fallbacks exhausted');
}

Fallback chains ensure availability when primary systems fail or rate limits hit.

Cost Metering Per User and Tenant

Track costs at request granularity:

async function recordCost(userId, tenantId, operation, tokens) {
  const costUsd = (tokens / 1000) * 0.01; // adjust pricing

  await db.costs.insert({
    userId,
    tenantId,
    operation,
    tokens,
    costUsd,
    timestamp: new Date()
  });

  // Alert if user exceeds daily limit
  const dailyCost = await db.costs.sumByDate(userId, today());
  if (dailyCost > limits[user.tier]) {
    await sendAlert(userId, `Cost limit exceeded: $${dailyCost}`);
    await disableAIFeatures(userId);
  }
}

Without cost metering, a single user can bankrupt your service.

Rate Limiting by Tokens, Not Requests

Traditional rate limiting (X requests per minute) fails for AI. One request with 100K tokens is far costlier than one request with 100 tokens.

const tokenBucket = new TokenBucket(
  capacity: 100000,    // tokens
  refillRate: 10000,   // tokens per minute
);

app.use(async (req, res, next) => {
  const estimatedTokens = estimateTokens(req.body.prompt);

  if (!tokenBucket.tryConsume(estimatedTokens)) {
    return res.status(429).json({
      error: 'Rate limit exceeded',
      retryAfter: tokenBucket.timeUntilRefill()
    });
  }
  next();
});

Token-based limits align costs with rate limiting.

AI Feature Flags for Gradual Rollout

Feature flags let you roll out new models safely:

async function getResponse(prompt, userId) {
  const flags = await featureFlags.getForUser(userId);

  if (flags['use-new-gpt4o-model'] === true) {
    return await openai.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: prompt }]
    });
  }

  return await openai.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }]
  });
}

Roll out incrementally: enable for 5% of users first, then 25%, then 100%.

Observability Stack for AI Systems

Monitor at multiple levels:

// Trace-level: full request context
const span = tracer.startSpan('llm_request', {
  attributes: {
    model: 'gpt-4o',
    prompt_tokens: 150,
    max_tokens: 500,
    temperature: 0.7,
    userId,
    tenantId
  }
});

// Metric-level: aggregate patterns
metrics.histogram('llm.latency_ms', latencyMs, {
  model: 'gpt-4o',
  user_tier: 'premium'
});

metrics.counter('llm.tokens_used', tokenCount, {
  model: 'gpt-4o',
  operation: 'summarize'
});

// Log-level: events
logger.info('LLM request completed', {
  llmLatency: '2500ms',
  totalLatency: '2650ms',
  model: 'gpt-4o',
  tokensSaved: 0
});

Observability reveals cost spikes, latency outliers, and error patterns before they become incidents.

Multi-Tenant AI Data Isolation

In multi-tenant systems, prevent cross-tenant data leakage:

// Vector DB query: always filter by tenant
const results = await vectordb.search(embedding, {
  filter: { tenantId: req.user.tenantId },
  topK: 5
});

// LLM context: include tenant ID
const systemPrompt = `You are an AI assistant for ${tenant.name}.
Use only the following context that belongs to this tenant:`;

// Cache key: include tenant
const cacheKey = hash(JSON.stringify({
  tenantId: req.user.tenantId,
  prompt: req.body.prompt
}));

Tenant isolation is non-negotiable for security and compliance.

Checklist

  • Implement async-first job queues for LLM calls
  • Add streaming endpoints for real-time responses
  • Deploy exact-match and semantic caching
  • Design three-tier fallback chains (premium → standard → cache)
  • Track costs per user and tenant
  • Rate limit by tokens, not requests
  • Use feature flags for model rollouts
  • Instrument LLM latency, tokens, and costs
  • Enforce tenant data isolation in vector stores and prompts
  • Set hard cost limits and kill switches

Conclusion

Scaling AI products requires fundamentally different architecture than traditional backends. Async-first patterns, intelligent caching, fallback chains, and cost-aware rate limiting form the foundation. Pair these with comprehensive observability and tenant isolation, and you'll build systems that survive real-world usage: rate limits, cost overruns, and model failures.