Long Context vs RAG — When to Stuff the Context and When to Retrieve

Introduction

Claude 3.5 Sonnet supports 200K tokens. GPT-4 Turbo supports 128K. The temptation: feed everything to the LLM and skip retrieval. But long context isn't free—it costs 4x more than short context in many pricing models, and documents buried in the middle of a 200K window are often ignored. This post explores when long context shines and when RAG remains superior.

When Long Context Beats RAG
When RAG Beats Long Context
The Lost-in-the-Middle Problem
200K Context Window: Practical Limits
Context Window Cost Comparison
Hybrid Approach: Retrieve & Full-Document Context
Late Interaction Models (ColBERT)
Testing Retrieval vs. Context Stuffing
Checklist
Conclusion

When Long Context Beats RAG

Long context wins when the knowledge base is small or questions require full-document understanding.

Scenario 1: Whole-Codebase Q&A

For a 50KB codebase, stuff it all into context:

async function wholeCodebaseQA(
  codebase: string,
  question: string
): Promise&lt;string&gt; {
  const response = await llm.generate({
    prompt: `You have the full codebase:\n\n${codebase}\n\nQuestion: ${question}`,
    model: 'claude-3-5-sonnet-200k',
    maxTokens: 2000,
  });

  return response;
}

The LLM sees all code relationships, import chains, and architectural patterns in one pass. Retrieval can't match this holistic understanding.

Scenario 2: Single-Document Analysis

For legal contracts, research papers, or specifications, long context is ideal:

async function analyzeDocument(
  document: string,
  questions: string[]
): Promise&lt;string[]&gt; {
  const answers = await Promise.all(
    questions.map(q =&gt;
      llm.generate({
        prompt: `Document:\n\n${document}\n\nQuestion: ${q}`,
        model: 'claude-3-5-sonnet-200k',
      })
    )
  );

  return answers;
}

Without retrieval, you avoid index mismatches and get precise cross-document references.

When RAG Beats Long Context

RAG wins at scale and for cost-sensitive applications.

Scenario: 10M-Document Knowledge Base

Stuffing is impossible. RAG fetches <10 relevant documents from millions:

async function scalableQA(
  query: string,
  knowledgeBase: VectorStore
): Promise&lt;string&gt; {
  // Retrieve 5 documents from 10M
  const docs = await knowledgeBase.search(query, { topK: 5 });

  const response = await llm.generate({
    prompt: `Context:\n${docs.map(d =&gt; d.content).join('\n\n')}\n\nQuestion: ${query}`,
    model: 'gpt-4-turbo',
    maxTokens: 2000,
  });

  return response;
}

RAG inputs <10K tokens (the retrieved docs), saving >90% on API costs.

The Lost-in-the-Middle Problem

LLMs perform worse when relevant information is in the middle of a long context window. Recent research shows:

Position 1 (start): 90% retrieval accuracy
Position 50% (middle): 25-50% accuracy
Position 100% (end): 85-95% accuracy

The U-shape pattern means center information is often overlooked.

interface ContextPositionImpact {
  position: 'start' | 'middle' | 'end';
  accuracy: number; // 0-1
}

// To mitigate: place most important context at start and end
async function optimizeContextOrder(
  query: string,
  documents: Document[]
): Promise&lt;string&gt; {
  const scored = documents.map(doc =&gt; ({
    doc,
    relevance: computeRelevance(query, doc),
  }));

  // Sort by relevance
  scored.sort((a, b) =&gt; b.relevance - a.relevance);

  // Place top 2 docs at start and end; rest in middle
  const topDocs = scored.slice(0, 2).map(s =&gt; s.doc);
  const middleDocs = scored.slice(2, -2).map(s =&gt; s.doc);
  const bottomDocs = scored.slice(-2).map(s =&gt; s.doc);

  const orderedContext = [
    ...topDocs,
    ...middleDocs,
    ...bottomDocs,
  ]
    .map(d =&gt; d.content)
    .join('\n\n');

  return await llm.generate({
    prompt: `Context:\n${orderedContext}\n\nQuestion: ${query}`,
  });
}

This reordering can improve accuracy by 10-20% without changing retrieval quality.

200K Context Window: Practical Limits

A 200K token window sounds unlimited. In practice, it's constrained by:

Effective utilization: Due to lost-in-the-middle, ~5-10 documents are effective, ~40-50K tokens
Cost: 200K input tokens at $3 per 1M tokens = $0.60 per request (vs. $0.03 for a 10K RAG call)
Latency: Longer context = slower generation (more compute per token)

For a customer-support chatbot answering 1000 queries/day:

Long context: 1000 × $0.60 = $600/day
RAG: 1000 × $0.03 = $30/day

RAG saves $570/day—20x cheaper.

Context Window Cost Comparison

interface CostModel {
  inputTokenPrice: number; // per 1M tokens
  outputTokenPrice: number;
  inputTokens: number;
  outputTokens: number;
}

function compareCosts(longContext: CostModel, rag: CostModel): void {
  const longContextCost =
    (longContext.inputTokens / 1_000_000) * longContext.inputTokenPrice +
    (longContext.outputTokens / 1_000_000) * longContext.outputTokenPrice;

  const ragCost =
    (rag.inputTokens / 1_000_000) * rag.inputTokenPrice +
    (rag.outputTokens / 1_000_000) * rag.outputTokenPrice;

  console.log(`Long Context: $${longContextCost.toFixed(4)}`);
  console.log(`RAG: $${ragCost.toFixed(4)}`);
  console.log(`Savings: ${(((longContextCost - ragCost) / longContextCost) * 100).toFixed(1)}%`);
}

// Example: 200K context vs. 10K RAG retrieval
compareCosts(
  { inputTokenPrice: 3, outputTokenPrice: 15, inputTokens: 200_000, outputTokens: 500 },
  { inputTokenPrice: 3, outputTokenPrice: 15, inputTokens: 10_000, outputTokens: 500 }
);

Hybrid Approach: Retrieve & Full-Document Context

Many teams use a hybrid: retrieve top documents, then feed their entire contents to the LLM.

async function hybridRetrieval(
  query: string,
  vectorDb: VectorStore,
  options: { maxDocs: number; maxTokensPerDoc: number }
): Promise&lt;string&gt; {
  // Step 1: Retrieve relevant documents
  const chunks = await vectorDb.search(query, { topK: 20 });

  // Step 2: Group chunks by document ID and fetch full documents
  const docIds = new Set(chunks.map(c =&gt; c.docId));
  const fullDocs = await Promise.all(
    Array.from(docIds).map(id =&gt; vectorDb.getFullDocument(id))
  );

  // Step 3: Truncate to token budget
  const truncatedDocs = fullDocs
    .slice(0, options.maxDocs)
    .map(doc =&gt; doc.content.substring(0, options.maxTokensPerDoc * 4)); // 4 chars ≈ 1 token

  const context = truncatedDocs.join('\n\n---\n\n');

  return await llm.generate({
    prompt: `Context:\n${context}\n\nQuestion: ${query}`,
  });
}

This captures document relationships while using retrieval to focus on relevant sources. Cost: 40-60K tokens instead of 200K, with better accuracy than truncated retrieval.

Late Interaction Models (ColBERT)

ColBERT delays similarity computation until the LLM generation phase, allowing the LLM to weigh chunks differently.

interface ColBERTToken {
  text: string;
  embedding: number[]; // 128-dim
}

interface ColBERTDocument {
  tokens: ColBERTToken[];
}

// Traditional: compute similarity at retrieval time
// ColBERT: encode at index time, compute similarity on-the-fly per token

async function colbertSearch(
  query: string,
  documents: ColBERTDocument[]
): Promise&lt;Document[]&gt; {
  const queryTokens = await llm.tokenizeAndEmbed(query);

  // Compute token-level similarity across all documents
  const scores = documents.map(doc =&gt; ({
    doc,
    score: queryTokens.reduce((sum, qt) =&gt; {
      const maxSim = Math.max(...doc.tokens.map(dt =&gt; cosineSimilarity(qt.embedding, dt.embedding)));
      return sum + maxSim;
    }, 0),
  }));

  return scores
    .sort((a, b) =&gt; b.score - a.score)
    .slice(0, 10)
    .map(s =&gt; s.doc);
}

ColBERT is more expensive (finer-grained matching) but more accurate, especially for dense passages.

Testing Retrieval vs. Context Stuffing

Build an A/B test framework:

async function compareApproaches(
  query: string,
  vectorDb: VectorStore,
  fullDocument: string
): Promise&lt;{ longContext: string; rag: string; costs: Record&lt;string, number&gt; }&gt; {
  const start = Date.now();

  // Approach 1: Long context (full document)
  const longContextResponse = await llm.generate({
    prompt: `Document:\n\n${fullDocument}\n\nQuestion: ${query}`,
  });
  const longContextTime = Date.now() - start;

  // Approach 2: RAG retrieval
  const ragDocs = await vectorDb.search(query, { topK: 5 });
  const ragContext = ragDocs.map(d =&gt; d.content).join('\n\n');
  const ragResponse = await llm.generate({
    prompt: `Context:\n${ragContext}\n\nQuestion: ${query}`,
  });
  const ragTime = Date.now() - start;

  return {
    longContext: longContextResponse,
    rag: ragResponse,
    costs: {
      longContextLatency: longContextTime,
      ragLatency: ragTime,
    },
  };
}

Evaluate both on accuracy, cost, and latency. For most production systems, RAG dominates.

Checklist

For small (<10MB) static knowledge bases, use long context
For large or frequently-updated bases, use RAG
Be aware of lost-in-the-middle; reorder context by relevance
Calculate cost per query—RAG usually wins by >10x
Try hybrid approach: retrieve, then pass full documents
Experiment with positioning (put important info at start/end)
A/B test both approaches on your workload

Conclusion

Long context is powerful but expensive and prone to overlooking center information. For scale, cost, and accuracy, RAG retrieves <10 relevant documents and lets the LLM focus. Use long context for small, static datasets and whole-document analysis. For everything else, RAG dominates.