LLM Observability in Production — Tracing Every Token From Request to Response

Introduction

When your LLM application hits production, visibility becomes survival. A single slow model call cascades into timeouts. An unexpected cost spike destroys margins. Tokens disappear into a black box, and you have no way to correlate which user triggered which LLM call or tool invocation.

This post walks you through production-grade LLM observability: tracing every token, tracking costs per request, and alerting on anomalies before they become incidents.

OpenTelemetry Spans for LLM Calls
Trace Correlation Across Systems
Sampling Strategy for High-Volume Applications
Alerting on Latency and Cost Anomalies
Conclusion

OpenTelemetry Spans for LLM Calls

OpenTelemetry provides the standard instrumentation layer. Here's how to wrap every LLM API call in a span that captures tokens, latency, and model metadata:

import {
  trace,
  context,
  SpanStatusCode,
  Attributes,
} from "@opentelemetry/api";
import { Anthropic } from "@anthropic-ai/sdk";

const tracer = trace.getTracer("llm-service");

interface LLMCallMetrics {
  inputTokens: number;
  outputTokens: number;
  totalTokens: number;
  costUSD: number;
  latencyMs: number;
  model: string;
}

async function callLLMWithTracing(
  prompt: string,
  model: string = "claude-3-5-sonnet-20241022",
  userId: string = "unknown"
): Promise<{ text: string; metrics: LLMCallMetrics }> {
  const span = tracer.startSpan("llm.completion", {
    attributes: {
      "llm.model": model,
      "llm.user_id": userId,
      "llm.prompt_length": prompt.length,
      "span.kind": "internal",
    },
  });

  return context.with(trace.setSpan(context.active(), span), async () => {
    const startTime = Date.now();
    try {
      const client = new Anthropic({
        apiKey: process.env.ANTHROPIC_API_KEY,
      });

      const message = await client.messages.create({
        model,
        max_tokens: 1024,
        messages: [{ role: "user", content: prompt }],
      });

      const latencyMs = Date.now() - startTime;
      const inputTokens = message.usage.input_tokens;
      const outputTokens = message.usage.output_tokens;
      const totalTokens = inputTokens + outputTokens;

      // Cost calculation (Sonnet pricing as example)
      const costUSD =
        (inputTokens * 0.003 + outputTokens * 0.015) / 1000;

      span.setAttributes({
        "llm.input_tokens": inputTokens,
        "llm.output_tokens": outputTokens,
        "llm.total_tokens": totalTokens,
        "llm.cost_usd": costUSD,
        "llm.latency_ms": latencyMs,
        "llm.finish_reason": message.stop_reason,
      });

      const responseText =
        message.content[0].type === "text" ? message.content[0].text : "";

      span.setStatus({ code: SpanStatusCode.OK });

      return {
        text: responseText,
        metrics: {
          inputTokens,
          outputTokens,
          totalTokens,
          costUSD,
          latencyMs,
          model,
        },
      };
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      throw error;
    } finally {
      span.end();
    }
  });
}

export { callLLMWithTracing, LLMCallMetrics };

Trace Correlation Across Systems

The real power of observability appears when you correlate user sessions with LLM calls and tool invocations. Use a request ID that flows through your entire system:

import { context, trace } from "@opentelemetry/api";
import { v4 as uuidv4 } from "uuid";

interface RequestContext {
  requestId: string;
  userId: string;
  sessionId: string;
  startTime: number;
}

function createRequestContext(
  userId: string,
  sessionId: string
): RequestContext {
  return {
    requestId: uuidv4(),
    userId,
    sessionId,
    startTime: Date.now(),
  };
}

function withRequestContext<T>(
  ctx: RequestContext,
  fn: () => Promise<T>
): Promise<T> {
  const span = trace.getActiveSpan();
  if (span) {
    span.setAttributes({
      "request.id": ctx.requestId,
      "user.id": ctx.userId,
      "session.id": ctx.sessionId,
    });
  }
  return fn();
}

async function orchestrateLLMAndTools(
  ctx: RequestContext,
  userQuery: string
): Promise<string> {
  const span = trace.getTracer("service").startSpan("user_request", {
    attributes: {
      "request.id": ctx.requestId,
      "user.id": ctx.userId,
    },
  });

  return context.with(
    trace.setSpan(context.active(), span),
    async () => {
      try {
        const llmResult = await callLLMWithTracing(
          userQuery,
          undefined,
          ctx.userId
        );

        // Tool calls, database queries, etc. all inherit this span context
        const toolResult = await executeRelevantTools(ctx, llmResult.text);

        span.addEvent("tools_executed", {
          "tools.count": toolResult.length,
        });

        return toolResult.map((t) => t.result).join("\n");
      } finally {
        span.end();
      }
    }
  );
}

async function executeRelevantTools(
  ctx: RequestContext,
  llmSuggestion: string
): Promise<Array<{ toolName: string; result: string }>> {
  // Simulated tool execution with spans
  const span = trace
    .getTracer("tools")
    .startSpan("execute_tools", {
      attributes: {
        "request.id": ctx.requestId,
      },
    });

  return context.with(trace.setSpan(context.active(), span), async () => {
    // Tool invocations inherit the request.id attribute
    return [];
  });
}

export { createRequestContext, withRequestContext, orchestrateLLMAndTools };

Sampling Strategy for High-Volume Applications

At scale, tracing every request becomes prohibitively expensive. Implement intelligent sampling that increases fidelity when problems occur:

import { Sampler, SamplingDecision, Context } from "@opentelemetry/api";

interface SamplingConfig {
  defaultRate: number; // 0.1 = 10%
  errorRate: number; // 1.0 = 100%
  slowThresholdMs: number; // 5000ms
  slowRate: number; // 0.5 = 50%
  costThresholdUSD: number; // $0.10
  costRate: number; // 1.0 = 100%
}

class AdaptiveLLMSampler implements Sampler {
  constructor(private config: SamplingConfig) {}

  shouldSample(
    _context: Context,
    _traceId: string,
    _spanName: string,
    _spanKind: number,
    attributes: { [key: string]: unknown },
    _links: Array<unknown>
  ): SamplingDecision {
    // Always sample errors
    if (attributes["error"]) {
      return { decision: 2 }; // RECORD_AND_SAMPLE
    }

    // Always sample slow requests
    const latency = attributes["llm.latency_ms"] as number | undefined;
    if (latency && latency > this.config.slowThresholdMs) {
      return { decision: 2 };
    }

    // Always sample high-cost requests
    const cost = attributes["llm.cost_usd"] as number | undefined;
    if (cost && cost > this.config.costThresholdUSD) {
      return { decision: 2 };
    }

    // Default sampling rate for normal requests
    if (Math.random() < this.config.defaultRate) {
      return { decision: 2 };
    }

    return { decision: 1 }; // RECORD_ONLY (metrics but no trace)
  }

  getDescription(): string {
    return "AdaptiveLLMSampler";
  }
}

export { AdaptiveLLMSampler, SamplingConfig };

Alerting on Latency and Cost Anomalies

With spans flowing into your observability backend (Datadog, New Relic, Grafana Loki), set up alerts that trigger when token counts, costs, or latencies deviate from baseline:

interface AnomalyDetectionConfig {
  latencyBaselineMs: number;
  latencyThresholdStdDev: number; // 3.0 = 3 sigma
  costBaseline: number;
  costThresholdMultiplier: number; // 5.0 = 5x normal
  tokenBaselinePer1kReqs: number;
  tokenThreshold: number;
}

class LLMAnomalyDetector {
  private latencyHistory: number[] = [];
  private costHistory: number[] = [];

  constructor(private config: AnomalyDetectionConfig) {}

  checkLatencyAnomaly(latencyMs: number): boolean {
    this.latencyHistory.push(latencyMs);

    // Keep last 100 requests
    if (this.latencyHistory.length > 100) {
      this.latencyHistory.shift();
    }

    if (this.latencyHistory.length < 10) return false;

    const mean =
      this.latencyHistory.reduce((a, b) => a + b, 0) /
      this.latencyHistory.length;
    const variance =
      this.latencyHistory.reduce((acc, val) => acc + Math.pow(val - mean, 2), 0) /
      this.latencyHistory.length;
    const stdDev = Math.sqrt(variance);

    const zScore = Math.abs((latencyMs - mean) / stdDev);
    return zScore > this.config.latencyThresholdStdDev;
  }

  checkCostAnomaly(costUSD: number): boolean {
    this.costHistory.push(costUSD);

    if (this.costHistory.length > 100) {
      this.costHistory.shift();
    }

    if (this.costHistory.length < 10) return false;

    const avgCost =
      this.costHistory.reduce((a, b) => a + b, 0) /
      this.costHistory.length;
    const threshold = avgCost * this.config.costThresholdMultiplier;

    return costUSD > threshold;
  }
}

export { LLMAnomalyDetector, AnomalyDetectionConfig };

Conclusion

Production LLM observability requires three layers: tracing (OpenTelemetry spans), correlation (request IDs across systems), and anomaly detection (costs, latencies, tokens). Without these, you''re flying blind when production breaks.

Start with OpenTelemetry spans around LLM calls, add request context correlation, then layer on sampling and alerting as your volume scales. The earlier you instrument, the faster you''ll debug production incidents.