- Published on
Long Context vs RAG — When to Stuff the Context and When to Retrieve
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Claude 3.5 Sonnet supports 200K tokens. GPT-4 Turbo supports 128K. The temptation: feed everything to the LLM and skip retrieval. But long context isn't free—it costs 4x more than short context in many pricing models, and documents buried in the middle of a 200K window are often ignored. This post explores when long context shines and when RAG remains superior.
- When Long Context Beats RAG
- When RAG Beats Long Context
- The Lost-in-the-Middle Problem
- 200K Context Window: Practical Limits
- Context Window Cost Comparison
- Hybrid Approach: Retrieve & Full-Document Context
- Late Interaction Models (ColBERT)
- Testing Retrieval vs. Context Stuffing
- Checklist
- Conclusion
When Long Context Beats RAG
Long context wins when the knowledge base is small or questions require full-document understanding.
Scenario 1: Whole-Codebase Q&A
For a 50KB codebase, stuff it all into context:
async function wholeCodebaseQA(
codebase: string,
question: string
): Promise<string> {
const response = await llm.generate({
prompt: `You have the full codebase:\n\n${codebase}\n\nQuestion: ${question}`,
model: 'claude-3-5-sonnet-200k',
maxTokens: 2000,
});
return response;
}
The LLM sees all code relationships, import chains, and architectural patterns in one pass. Retrieval can't match this holistic understanding.
Scenario 2: Single-Document Analysis
For legal contracts, research papers, or specifications, long context is ideal:
async function analyzeDocument(
document: string,
questions: string[]
): Promise<string[]> {
const answers = await Promise.all(
questions.map(q =>
llm.generate({
prompt: `Document:\n\n${document}\n\nQuestion: ${q}`,
model: 'claude-3-5-sonnet-200k',
})
)
);
return answers;
}
Without retrieval, you avoid index mismatches and get precise cross-document references.
When RAG Beats Long Context
RAG wins at scale and for cost-sensitive applications.
Scenario: 10M-Document Knowledge Base
Stuffing is impossible. RAG fetches <10 relevant documents from millions:
async function scalableQA(
query: string,
knowledgeBase: VectorStore
): Promise<string> {
// Retrieve 5 documents from 10M
const docs = await knowledgeBase.search(query, { topK: 5 });
const response = await llm.generate({
prompt: `Context:\n${docs.map(d => d.content).join('\n\n')}\n\nQuestion: ${query}`,
model: 'gpt-4-turbo',
maxTokens: 2000,
});
return response;
}
RAG inputs <10K tokens (the retrieved docs), saving >90% on API costs.
The Lost-in-the-Middle Problem
LLMs perform worse when relevant information is in the middle of a long context window. Recent research shows:
- Position 1 (start): 90% retrieval accuracy
- Position 50% (middle): 25-50% accuracy
- Position 100% (end): 85-95% accuracy
The U-shape pattern means center information is often overlooked.
interface ContextPositionImpact {
position: 'start' | 'middle' | 'end';
accuracy: number; // 0-1
}
// To mitigate: place most important context at start and end
async function optimizeContextOrder(
query: string,
documents: Document[]
): Promise<string> {
const scored = documents.map(doc => ({
doc,
relevance: computeRelevance(query, doc),
}));
// Sort by relevance
scored.sort((a, b) => b.relevance - a.relevance);
// Place top 2 docs at start and end; rest in middle
const topDocs = scored.slice(0, 2).map(s => s.doc);
const middleDocs = scored.slice(2, -2).map(s => s.doc);
const bottomDocs = scored.slice(-2).map(s => s.doc);
const orderedContext = [
...topDocs,
...middleDocs,
...bottomDocs,
]
.map(d => d.content)
.join('\n\n');
return await llm.generate({
prompt: `Context:\n${orderedContext}\n\nQuestion: ${query}`,
});
}
This reordering can improve accuracy by 10-20% without changing retrieval quality.
200K Context Window: Practical Limits
A 200K token window sounds unlimited. In practice, it's constrained by:
- Effective utilization: Due to lost-in-the-middle, ~5-10 documents are effective, ~40-50K tokens
- Cost: 200K input tokens at
$3per 1M tokens =$0.60per request (vs.$0.03for a 10K RAG call) - Latency: Longer context = slower generation (more compute per token)
For a customer-support chatbot answering 1000 queries/day:
- Long context: 1000 ×
$0.60=$600/day - RAG: 1000 ×
$0.03=$30/day
RAG saves $570/day—20x cheaper.
Context Window Cost Comparison
interface CostModel {
inputTokenPrice: number; // per 1M tokens
outputTokenPrice: number;
inputTokens: number;
outputTokens: number;
}
function compareCosts(longContext: CostModel, rag: CostModel): void {
const longContextCost =
(longContext.inputTokens / 1_000_000) * longContext.inputTokenPrice +
(longContext.outputTokens / 1_000_000) * longContext.outputTokenPrice;
const ragCost =
(rag.inputTokens / 1_000_000) * rag.inputTokenPrice +
(rag.outputTokens / 1_000_000) * rag.outputTokenPrice;
console.log(`Long Context: $${longContextCost.toFixed(4)}`);
console.log(`RAG: $${ragCost.toFixed(4)}`);
console.log(`Savings: ${(((longContextCost - ragCost) / longContextCost) * 100).toFixed(1)}%`);
}
// Example: 200K context vs. 10K RAG retrieval
compareCosts(
{ inputTokenPrice: 3, outputTokenPrice: 15, inputTokens: 200_000, outputTokens: 500 },
{ inputTokenPrice: 3, outputTokenPrice: 15, inputTokens: 10_000, outputTokens: 500 }
);
Hybrid Approach: Retrieve & Full-Document Context
Many teams use a hybrid: retrieve top documents, then feed their entire contents to the LLM.
async function hybridRetrieval(
query: string,
vectorDb: VectorStore,
options: { maxDocs: number; maxTokensPerDoc: number }
): Promise<string> {
// Step 1: Retrieve relevant documents
const chunks = await vectorDb.search(query, { topK: 20 });
// Step 2: Group chunks by document ID and fetch full documents
const docIds = new Set(chunks.map(c => c.docId));
const fullDocs = await Promise.all(
Array.from(docIds).map(id => vectorDb.getFullDocument(id))
);
// Step 3: Truncate to token budget
const truncatedDocs = fullDocs
.slice(0, options.maxDocs)
.map(doc => doc.content.substring(0, options.maxTokensPerDoc * 4)); // 4 chars ≈ 1 token
const context = truncatedDocs.join('\n\n---\n\n');
return await llm.generate({
prompt: `Context:\n${context}\n\nQuestion: ${query}`,
});
}
This captures document relationships while using retrieval to focus on relevant sources. Cost: 40-60K tokens instead of 200K, with better accuracy than truncated retrieval.
Late Interaction Models (ColBERT)
ColBERT delays similarity computation until the LLM generation phase, allowing the LLM to weigh chunks differently.
interface ColBERTToken {
text: string;
embedding: number[]; // 128-dim
}
interface ColBERTDocument {
tokens: ColBERTToken[];
}
// Traditional: compute similarity at retrieval time
// ColBERT: encode at index time, compute similarity on-the-fly per token
async function colbertSearch(
query: string,
documents: ColBERTDocument[]
): Promise<Document[]> {
const queryTokens = await llm.tokenizeAndEmbed(query);
// Compute token-level similarity across all documents
const scores = documents.map(doc => ({
doc,
score: queryTokens.reduce((sum, qt) => {
const maxSim = Math.max(...doc.tokens.map(dt => cosineSimilarity(qt.embedding, dt.embedding)));
return sum + maxSim;
}, 0),
}));
return scores
.sort((a, b) => b.score - a.score)
.slice(0, 10)
.map(s => s.doc);
}
ColBERT is more expensive (finer-grained matching) but more accurate, especially for dense passages.
Testing Retrieval vs. Context Stuffing
Build an A/B test framework:
async function compareApproaches(
query: string,
vectorDb: VectorStore,
fullDocument: string
): Promise<{ longContext: string; rag: string; costs: Record<string, number> }> {
const start = Date.now();
// Approach 1: Long context (full document)
const longContextResponse = await llm.generate({
prompt: `Document:\n\n${fullDocument}\n\nQuestion: ${query}`,
});
const longContextTime = Date.now() - start;
// Approach 2: RAG retrieval
const ragDocs = await vectorDb.search(query, { topK: 5 });
const ragContext = ragDocs.map(d => d.content).join('\n\n');
const ragResponse = await llm.generate({
prompt: `Context:\n${ragContext}\n\nQuestion: ${query}`,
});
const ragTime = Date.now() - start;
return {
longContext: longContextResponse,
rag: ragResponse,
costs: {
longContextLatency: longContextTime,
ragLatency: ragTime,
},
};
}
Evaluate both on accuracy, cost, and latency. For most production systems, RAG dominates.
Checklist
- For small (<10MB) static knowledge bases, use long context
- For large or frequently-updated bases, use RAG
- Be aware of lost-in-the-middle; reorder context by relevance
- Calculate cost per query—RAG usually wins by >10x
- Try hybrid approach: retrieve, then pass full documents
- Experiment with positioning (put important info at start/end)
- A/B test both approaches on your workload
Conclusion
Long context is powerful but expensive and prone to overlooking center information. For scale, cost, and accuracy, RAG retrieves <10 relevant documents and lets the LLM focus. Use long context for small, static datasets and whole-document analysis. For everything else, RAG dominates.