Published on

Intelligent LLM Model Routing — Sending the Right Query to the Right Model

Authors

Introduction

Not every query needs GPT-4. Simple questions cost 6x less on GPT-3.5. This guide covers intelligent routing: classify query complexity, match to model tiers, and A/B test routing decisions.

Query Complexity Classification

Classify queries as simple, medium, or complex to route appropriately.

interface QueryClassification {
  complexity: 'simple' | 'medium' | 'complex';
  score: number;
  reasons: string[];
}

class QueryComplexityClassifier {
  private tokenCounter: any; // tiktoken encoder

  classify(query: string): QueryClassification {
    const reasons: string[] = [];
    let score = 0;

    // Length is a strong signal
    const wordCount = query.split(/\s+/).length;
    if (wordCount < 10) {
      score += 10;
      reasons.push('Very short query');
    } else if (wordCount < 30) {
      score += 20;
      reasons.push('Short query');
    } else if (wordCount < 100) {
      score += 40;
      reasons.push('Medium length query');
    } else if (wordCount < 300) {
      score += 60;
      reasons.push('Long query');
    } else {
      score += 80;
      reasons.push('Very long query');
    }

    // Code presence signals complexity
    const codeBlocks = (query.match(/```/g) || []).length / 2;
    if (codeBlocks > 0) {
      score += 30 * codeBlocks;
      reasons.push(`Contains ${codeBlocks} code block(s)`);
    }

    // Math/formula presence
    const hasFormulas = /[\+\-\*\/\=\^]|integral|derivative|sqrt|matrix/i.test(query);
    if (hasFormulas) {
      score += 25;
      reasons.push('Contains mathematical expressions');
    }

    // Multi-step reasoning indicators
    const complexKeywords = /explain|why|how|design|architecture|optimize|refactor|debug|analyze/i;
    if (complexKeywords.test(query)) {
      score += 20;
      reasons.push('Requires reasoning');
    }

    // Multiple questions
    const questionCount = (query.match(/\?/g) || []).length;
    if (questionCount > 1) {
      score += 15 * questionCount;
      reasons.push(`Contains ${questionCount} questions`);
    }

    // Determine category
    let complexity: 'simple' | 'medium' | 'complex';
    if (score < 30) complexity = 'simple';
    else if (score < 70) complexity = 'medium';
    else complexity = 'complex';

    return { complexity, score, reasons };
  }
}

Cost Tiers (Nano/Micro/Standard/Premium)

Define model tiers by cost and capability.

interface ModelTier {
  name: 'nano' | 'micro' | 'standard' | 'premium';
  models: string[];
  cost_per_1k_input: number;
  cost_per_1k_output: number;
  latency_p95_ms: number;
  capability_score: number;
}

class ModelTierRegistry {
  private tiers: Record<string, ModelTier> = {
    nano: {
      name: 'nano',
      models: ['mistral-7b', 'phi-2', 'gemma-7b'],
      cost_per_1k_input: 0.0001,
      cost_per_1k_output: 0.0003,
      latency_p95_ms: 400,
      capability_score: 2
    },
    micro: {
      name: 'micro',
      models: ['gpt-3.5-turbo', 'mistral-small', 'llama-2-13b'],
      cost_per_1k_input: 0.0005,
      cost_per_1k_output: 0.0015,
      latency_p95_ms: 800,
      capability_score: 5
    },
    standard: {
      name: 'standard',
      models: ['gpt-4', 'claude-3-sonnet', 'gemini-pro'],
      cost_per_1k_input: 0.015,
      cost_per_1k_output: 0.06,
      latency_p95_ms: 2000,
      capability_score: 8
    },
    premium: {
      name: 'premium',
      models: ['gpt-4o', 'claude-3-opus', 'gemini-ultra'],
      cost_per_1k_input: 0.03,
      cost_per_1k_output: 0.12,
      latency_p95_ms: 3000,
      capability_score: 10
    }
  };

  getTierForComplexity(complexity: 'simple' | 'medium' | 'complex'): ModelTier {
    const tierMap = {
      simple: 'nano',
      medium: 'standard',
      complex: 'premium'
    };

    return this.tiers[tierMap[complexity]];
  }

  getCheapestModel(): string {
    return this.tiers.nano.models[0];
  }

  getMostCapableModel(): string {
    return this.tiers.premium.models[0];
  }

  estimateCost(model: string, inputTokens: number, outputTokens: number): number {
    // Find tier containing model
    for (const tier of Object.values(this.tiers)) {
      if (tier.models.includes(model)) {
        const inputCost = (inputTokens / 1000) * tier.cost_per_1k_input;
        const outputCost = (outputTokens / 1000) * tier.cost_per_1k_output;
        return inputCost + outputCost;
      }
    }

    throw new Error(`Unknown model: ${model}`);
  }
}

Router Model for Intelligent Routing

Use a cheap model to classify queries and select more expensive ones only when needed.

interface RoutingDecision {
  selected_model: string;
  tier: string;
  confidence: number;
  explanation: string;
}

class IntelligentRouter {
  private classifier = new QueryComplexityClassifier();
  private tierRegistry = new ModelTierRegistry();
  private classifierClient: any; // OpenAI client

  async routeQuery(query: string): Promise<RoutingDecision> {
    // First: classify with heuristics (free)
    const heuristicClassification = this.classifier.classify(query);

    // Second: use cheap model to verify (cost: ~0.001)
    const llmClassification = await this.classifyWithLLM(query, heuristicClassification);

    // Select tier based on consensus
    const tier = this.selectTier(
      heuristicClassification.complexity,
      llmClassification.confidence
    );

    // Pick cheapest model in tier
    const selectedModel = tier.models[0];

    return {
      selected_model: selectedModel,
      tier: tier.name,
      confidence: llmClassification.confidence,
      explanation: `Query is ${llmClassification.reasoning}. Routing to ${selectedModel}.`
    };
  }

  private async classifyWithLLM(
    query: string,
    heuristic: QueryClassification
  ): Promise<{ complexity: string; confidence: number; reasoning: string }> {
    const prompt = `Classify query complexity (simple/medium/complex). Query: "${query.substring(0, 100)}"`;

    const response = await this.classifierClient.chat.completions.create({
      model: 'gpt-3.5-turbo', // Cheap classifier
      messages: [{ role: 'user', content: prompt }],
      temperature: 0,
      max_tokens: 50
    });

    const content = response.choices[0].message.content.toLowerCase();

    let complexity = heuristic.complexity;
    let confidence = 0.7;

    if (content.includes('simple') || content.includes('easy')) {
      complexity = 'simple';
      confidence = 0.8;
    } else if (content.includes('complex') || content.includes('hard')) {
      complexity = 'complex';
      confidence = 0.8;
    }

    return {
      complexity,
      confidence,
      reasoning: content.substring(0, 50)
    };
  }

  private selectTier(
    heuristic: 'simple' | 'medium' | 'complex',
    llmConfidence: number
  ): ModelTier {
    // If LLM is confident, use its classification
    if (llmConfidence > 0.75) {
      return this.tierRegistry.getTierForComplexity(heuristic);
    }

    // Otherwise be conservative and go one tier up
    const tiers = ['nano', 'micro', 'standard', 'premium'] as const;
    const tierMap: Record<string, number> = {
      simple: 0,
      medium: 1,
      complex: 2
    };

    const tierIndex = Math.min(tierMap[heuristic] + 1, 3);
    return this.tierRegistry.tiers[tiers[tierIndex]];
  }
}

Intent-Based Routing

Route based on detected intent (customer support, code generation, creative writing).

interface IntentClassification {
  intent: 'support' | 'coding' | 'creative' | 'analysis' | 'general';
  confidence: number;
}

class IntentBasedRouter {
  private intentPatterns: Record<string, RegExp[]> = {
    support: [
      /help|support|issue|problem|error|bug|broken|fail/i,
      /how to|how do i|how can i/i
    ],
    coding: [
      /code|function|class|script|implement|debug|refactor|api/i,
      /write|generate|create.*code/i
    ],
    creative: [
      /write|story|poem|essay|article|novel|song|blog/i,
      /creative|imagine|brainstorm/i
    ],
    analysis: [
      /analyze|analysis|explain|understand|why|research/i,
      /data|statistics|trend|pattern/i
    ]
  };

  classify(query: string): IntentClassification {
    const scores: Record<string, number> = {
      support: 0,
      coding: 0,
      creative: 0,
      analysis: 0,
      general: 0
    };

    for (const [intent, patterns] of Object.entries(this.intentPatterns)) {
      for (const pattern of patterns) {
        if (pattern.test(query)) {
          scores[intent]++;
        }
      }
    }

    let maxScore = 0;
    let bestIntent: keyof typeof scores = 'general';

    for (const [intent, score] of Object.entries(scores)) {
      if (score > maxScore) {
        maxScore = score;
        bestIntent = intent as keyof typeof scores;
      }
    }

    const totalScore = Object.values(scores).reduce((a, b) => a + b, 0);
    const confidence = totalScore > 0 ? maxScore / totalScore : 0.5;

    return { intent: bestIntent as any, confidence };
  }

  selectModelForIntent(intent: string): string {
    const modelMap: Record<string, string> = {
      support: 'gpt-3.5-turbo', // Fast, good at instructions
      coding: 'gpt-4', // Strong at code
      creative: 'claude-3-opus', // Strong at writing
      analysis: 'gpt-4', // Strong at reasoning
      general: 'gpt-3.5-turbo' // Default
    };

    return modelMap[intent] || 'gpt-3.5-turbo';
  }
}

Confidence-Based Escalation

Start with cheap model, escalate to expensive one if confidence is low.

interface EscalationConfig {
  initial_model: string;
  escalation_threshold: number;
  max_escalations: number;
}

class ConfidenceBasedEscalator {
  private escalationChain = [
    'gpt-3.5-turbo',
    'gpt-4',
    'gpt-4o',
    'claude-3-opus'
  ];

  async respondWithEscalation(
    query: string,
    config: EscalationConfig = {
      initial_model: 'gpt-3.5-turbo',
      escalation_threshold: 0.6,
      max_escalations: 2
    },
    client: any
  ): Promise<{ response: string; model_used: string; escalations: number }> {
    let currentModelIndex = this.escalationChain.indexOf(config.initial_model);
    let escalations = 0;

    for (let attempt = 0; attempt < config.max_escalations + 1; attempt++) {
      const model = this.escalationChain[currentModelIndex];

      const response = await client.chat.completions.create({
        model,
        messages: [{ role: 'user', content: query }],
        temperature: 0.5
      });

      const content = response.choices[0].message.content;

      // Estimate confidence (heuristic)
      const confidence = this.estimateConfidence(content, query);

      if (confidence >= config.escalation_threshold) {
        return {
          response: content,
          model_used: model,
          escalations
        };
      }

      // Escalate to next model
      if (currentModelIndex < this.escalationChain.length - 1) {
        currentModelIndex++;
        escalations++;
      } else {
        // Already at best model, return answer
        return {
          response: content,
          model_used: model,
          escalations
        };
      }
    }

    throw new Error('Failed to get confident response after max escalations');
  }

  private estimateConfidence(response: string, query: string): number {
    let confidence = 0.5;

    // Response too short = low confidence
    if (response.length < 50) confidence -= 0.2;

    // Uncertainty phrases = low confidence
    if (/i'm not sure|i don't know|uncertain|unsure|likely|probably|maybe/i.test(response)) {
      confidence -= 0.15;
    }

    // Specific details = high confidence
    if (/\b\d{4}\b|specifically|exactly|definitely|certainly/i.test(response)) {
      confidence += 0.15;
    }

    return Math.max(0, Math.min(1, confidence));
  }
}

Latency-Based Routing for SLA

Route based on latency budget to meet SLA requirements.

interface LatencySLA {
  p95_ms: number;
  p99_ms: number;
}

class LatencyBasedRouter {
  private modelLatencies: Record<string, { p95: number; p99: number }> = {
    'gpt-3.5-turbo': { p95: 800, p99: 1200 },
    'gpt-4': { p95: 2000, p99: 3000 },
    'gpt-4o': { p95: 2500, p99: 3500 },
    'claude-3-opus': { p95: 3000, p99: 4000 },
    'mistral-7b': { p95: 200, p99: 400 } // Local
  };

  selectModelForSLA(sla: LatencySLA): string {
    // Find models that meet SLA
    const candidates = Object.entries(this.modelLatencies)
      .filter(([, latency]) => latency.p95 <= sla.p95_ms && latency.p99 <= sla.p99_ms)
      .sort((a, b) => a[1].p95 - b[1].p95); // Prefer faster

    if (candidates.length === 0) {
      throw new Error(`No model meets SLA: p95=${sla.p95_ms}ms, p99=${sla.p99_ms}ms`);
    }

    return candidates[0][0];
  }

  // Estimate combined time: routing + LLM + response time
  estimateTotalLatency(model: string, query: string, estimatedOutputTokens: number = 500): number {
    const routingOverhead = 50; // ms
    const modelLatency = this.modelLatencies[model]?.p95 || 1000;
    const outputLatency = (estimatedOutputTokens / 100) * 20; // ~20ms per 100 tokens

    return routingOverhead + modelLatency + outputLatency;
  }
}

A/B Testing Across Models

Measure model performance differences with live traffic.

interface ABTestResult {
  model_a: string;
  model_b: string;
  winner: string | null;
  metric: 'latency' | 'cost' | 'quality';
  improvement_percent: number;
  confidence: number;
}

class ABTestRunner {
  private results: Map<string, number[]> = new Map();

  recordVariant(variant: string, metric: number): void {
    if (!this.results.has(variant)) {
      this.results.set(variant, []);
    }
    this.results.get(variant)!.push(metric);
  }

  computeStats(variant: string): { mean: number; stddev: number; count: number } {
    const values = this.results.get(variant) || [];
    const mean = values.length > 0
      ? values.reduce((a, b) => a + b, 0) / values.length
      : 0;

    const variance = values.length > 0
      ? values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length
      : 0;

    return {
      mean,
      stddev: Math.sqrt(variance),
      count: values.length
    };
  }

  compareVariants(variantA: string, variantB: string, minSamples: number = 100): ABTestResult {
    const statsA = this.computeStats(variantA);
    const statsB = this.computeStats(variantB);

    // Need sufficient samples for statistical significance
    if (statsA.count < minSamples || statsB.count < minSamples) {
      return {
        model_a: variantA,
        model_b: variantB,
        winner: null,
        metric: 'cost',
        improvement_percent: 0,
        confidence: 0
      };
    }

    const improvement = ((statsA.mean - statsB.mean) / statsA.mean) * 100;
    const winner = improvement > 0 ? variantB : variantA;

    // T-test confidence
    const pooledStddev = Math.sqrt(
      (statsA.stddev ** 2 + statsB.stddev ** 2) / 2
    );
    const tScore = Math.abs(statsA.mean - statsB.mean) /
      (pooledStddev * Math.sqrt(1 / statsA.count + 1 / statsB.count));

    // Rough approximation of significance
    const confidence = tScore > 1.96 ? 0.95 : Math.min(0.95, tScore / 1.96);

    return {
      model_a: variantA,
      model_b: variantB,
      winner: confidence > 0.8 ? winner : null,
      metric: 'cost',
      improvement_percent: Math.abs(improvement),
      confidence
    };
  }
}

Routing Metrics and Monitoring

Track routing effectiveness and model performance.

interface RoutingMetrics {
  total_queries: number;
  routes_by_model: Record<string, number>;
  avg_cost_per_query: number;
  cost_saved_vs_premium: number;
  escalation_rate: number;
}

class RoutingMetricsCollector {
  private routingDecisions: Array<{
    model: string;
    cost: number;
    latency_ms: number;
    timestamp: Date;
  }> = [];

  recordRouting(model: string, cost: number, latency: number): void {
    this.routingDecisions.push({
      model,
      cost,
      latency_ms: latency,
      timestamp: new Date()
    });
  }

  getMetrics(days: number = 1): RoutingMetrics {
    const cutoff = new Date();
    cutoff.setDate(cutoff.getDate() - days);

    const recentDecisions = this.routingDecisions.filter(
      d => d.timestamp >= cutoff
    );

    const routesByModel: Record<string, number> = {};
    let totalCost = 0;

    for (const decision of recentDecisions) {
      routesByModel[decision.model] = (routesByModel[decision.model] || 0) + 1;
      totalCost += decision.cost;
    }

    // Compare to always using premium model
    const premiumCostBaseline = recentDecisions.length * 0.15; // Estimate
    const saved = Math.max(0, premiumCostBaseline - totalCost);

    const escalationCount = recentDecisions.filter(
      d => d.latency_ms > 2000
    ).length;

    return {
      total_queries: recentDecisions.length,
      routes_by_model: routesByModel,
      avg_cost_per_query: recentDecisions.length > 0 ? totalCost / recentDecisions.length : 0,
      cost_saved_vs_premium: saved,
      escalation_rate: recentDecisions.length > 0 ? escalationCount / recentDecisions.length : 0
    };
  }
}

Checklist

  • Classify query complexity with heuristics (length, code, math, questions)
  • Use cheap model (GPT-3.5) as classifier for routing decisions
  • Route simple queries to nano/micro tiers, complex to premium
  • Detect intent (support, coding, creative) and route by capability
  • Implement confidence-based escalation for low-confidence responses
  • Respect latency SLA and select fastest model that meets budget
  • A/B test model changes before full rollout
  • Monitor escalation rate to detect broken classifications
  • Calculate cost savings vs. baseline (always premium model)
  • Export routing metrics to analytics for dashboard visibility

Conclusion

Intelligent routing is the single biggest lever for cost reduction without sacrificing quality. Classify once, route decisively, and measure relentlessly. The savings compound—50% cost reduction with 99% quality retention is the sweet spot most teams never reach.