- Published on
RAG Pipeline in Production — From Prototype to Reliable Retrieval-Augmented Generation
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Retrieval-Augmented Generation (RAG) promises accurate, grounded LLM responses. But moving from a weekend prototype to reliable production involves chunking decisions, embedding model selection, semantic search optimization, reranking for quality, citation accuracy, and hallucination safeguards. This post covers the full pipeline with production patterns.
- Chunking Strategies: Fixed vs Semantic
- Embedding Models: OpenAI vs Local
- pgvector Setup & Semantic Search
- Reranking Results for Quality
- Citation Tracking & Hallucination Detection
- Latency Optimization
- RAG Implementation Checklist
- Conclusion
Chunking Strategies: Fixed vs Semantic
The foundation of RAG is dividing documents into retrievable chunks. Naive fixed-size chunking loses semantic boundaries; intelligent chunking preserves meaning.
// Fixed-size chunking (naive, loses context)
function fixedChunking(text: string, chunkSize: number = 512, overlap: number = 100): string[] {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += chunkSize - overlap) {
chunks.push(text.slice(i, i + chunkSize));
}
return chunks;
}
// Semantic chunking using sentence boundaries
function semanticChunking(text: string, maxChunkSize: number = 1024): string[] {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [];
const chunks: string[] = [];
let current = '';
for (const sentence of sentences) {
if ((current + sentence).length > maxChunkSize && current) {
chunks.push(current.trim());
current = sentence;
} else {
current += sentence;
}
}
if (current) chunks.push(current.trim());
return chunks;
}
// Recursive chunking with hierarchy (best for structured docs)
interface ChunkNode {
content: string;
metadata: { level: number; source: string };
children?: ChunkNode[];
}
function recursiveChunking(
text: string,
maxSize: number = 1024,
splitPatterns: RegExp[] = [/^## /m, /^### /m, /\n\n/m]
): ChunkNode[] {
const chunks: ChunkNode[] = [];
let content = text;
for (const pattern of splitPatterns) {
if (content.length <= maxSize) {
chunks.push({ content, metadata: { level: 0, source: 'doc' } });
return chunks;
}
const parts = content.split(pattern);
// Implementation for recursive split
}
return chunks;
}
Production insight: Use semantic chunking for narratives, recursive for documentation. Always maintain 15-20% overlap to preserve context at boundaries.
Embedding Models: OpenAI vs Local
Embeddings drive search quality. Production systems must balance cost, latency, and accuracy.
import { OpenAIClient } from '@azure/openai';
import * as tf from '@tensorflow/tfjs';
interface EmbeddingConfig {
provider: 'openai' | 'local';
model?: string;
batchSize: number;
cache: Map<string, number[]>;
}
class EmbeddingService {
private openaiClient: OpenAIClient;
private config: EmbeddingConfig;
private localModel: any;
constructor(config: EmbeddingConfig) {
this.config = config;
if (config.provider === 'openai') {
this.openaiClient = new OpenAIClient({ apiKey: process.env.OPENAI_API_KEY });
}
}
async embedText(text: string): Promise<number[]> {
// Check cache first
const cached = this.config.cache.get(text);
if (cached) return cached;
let embedding: number[];
if (this.config.provider === 'openai') {
embedding = await this.embedWithOpenAI(text);
} else {
embedding = await this.embedWithLocal(text);
}
this.config.cache.set(text, embedding);
return embedding;
}
private async embedWithOpenAI(text: string): Promise<number[]> {
const response = await this.openaiClient.getEmbeddings(this.config.model || 'text-embedding-3-small', [text]);
return response.data[0].embedding;
}
private async embedWithLocal(text: string): Promise<number[]> {
// All-MiniLM-L6-v2 is free, fast, excellent for production
// Runs locally, no API calls, consistent latency
const encoded = await this.localModel.encode(text);
return Array.from(encoded.data);
}
async batchEmbed(texts: string[]): Promise<number[][]> {
const results: number[][] = [];
for (let i = 0; i < texts.length; i += this.config.batchSize) {
const batch = texts.slice(i, i + this.config.batchSize);
const embeddings = await Promise.all(batch.map(t => this.embedText(t)));
results.push(...embeddings);
}
return results;
}
}
// Cost comparison over 1M documents:
// OpenAI text-embedding-3-small: $0.02 per 1M tokens = ~$2,000
// Local All-MiniLM-L6-v2: 0 cost, 200ms per document, 55GB total storage
Production pattern: Use OpenAI for initial MVP (simplicity), migrate to local models once volume justifies infra. Local embeddings also eliminate API call latency (200ms → 50ms per query).
pgvector Setup & Semantic Search
PostgreSQL with pgvector scales RAG cheaply. Cosine similarity finds relevant documents without external dependencies.
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Table design for documents
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(384),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- HNSW index for fast ANN (Approximate Nearest Neighbor)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- or IVFFlat for larger tables (>1M rows):
-- CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- GiST index for exact KNN during testing
CREATE INDEX ON documents USING gist (embedding vector_cosine_ops);
import { Pool } from 'pg';
interface SearchResult {
id: number;
content: string;
similarity: number;
metadata: any;
}
class VectorStore {
private pool: Pool;
constructor(connectionString: string) {
this.pool = new Pool({ connectionString });
}
async addDocument(content: string, embedding: number[], metadata: any): Promise<void> {
await this.pool.query(
'INSERT INTO documents (content, embedding, metadata) VALUES ($1, $2, $3)',
[content, JSON.stringify(embedding), metadata]
);
}
async search(queryEmbedding: number[], limit: number = 5): Promise<SearchResult[]> {
const results = await this.pool.query<SearchResult>(
`SELECT
id,
content,
1 - (embedding <=> $1::vector) AS similarity,
metadata
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[JSON.stringify(queryEmbedding), limit]
);
return results.rows;
}
async hybridSearch(
queryEmbedding: number[],
keyword: string,
limit: number = 5
): Promise<SearchResult[]> {
const results = await this.pool.query<SearchResult>(
`SELECT
id,
content,
1 - (embedding <=> $1::vector) AS similarity,
metadata
FROM documents
WHERE content ILIKE '%' || $2 || '%'
ORDER BY embedding <=> $1::vector
LIMIT $3`,
[JSON.stringify(queryEmbedding), keyword, limit]
);
return results.rows;
}
async deleteDocument(id: number): Promise<void> {
await this.pool.query('DELETE FROM documents WHERE id = $1', [id]);
}
async close(): Promise<void> {
await this.pool.end();
}
}
Reranking Results for Quality
Initial semantic search returns candidates. Reranking uses cross-encoders to score relevance more accurately.
import axios from 'axios';
interface RankedResult {
id: number;
content: string;
similarity: number;
rerank_score: number;
final_score: number;
}
class RerankingService {
private coherenceApiKey: string;
constructor(coherenceApiKey: string) {
this.coherenceApiKey = coherenceApiKey;
}
async rerank(
query: string,
candidates: Array<{ id: number; content: string; similarity: number }>,
topK: number = 3
): Promise<RankedResult[]> {
if (candidates.length === 0) return [];
// Cohere Rerank API (production-grade)
const response = await axios.post(
'https://api.cohere.ai/v1/rerank',
{
model: 'rerank-english-v2.0',
query,
documents: candidates.map(c => c.content),
top_n: topK,
},
{ headers: { Authorization: `Bearer ${this.coherenceApiKey}` } }
);
const ranked = response.data.results.map((result: any, idx: number) => ({
...candidates[result.index],
rerank_score: result.relevance_score,
final_score: candidates[result.index].similarity * 0.4 + result.relevance_score * 0.6,
}));
return ranked.sort((a, b) => b.final_score - a.final_score);
}
}
// Local reranking (no API cost, lower latency)
import * as tf from '@tensorflow/tfjs';
class LocalReranker {
private model: any;
async rerank(
query: string,
candidates: Array<{ id: number; content: string; similarity: number }>,
topK: number = 3
): Promise<RankedResult[]> {
const pairs = candidates.map(c => [query, c.content]);
const scores = await this.model.predict(pairs);
const ranked = candidates.map((candidate, idx) => ({
...candidate,
rerank_score: scores[idx],
final_score: candidate.similarity * 0.4 + scores[idx] * 0.6,
}));
return ranked.sort((a, b) => b.final_score - a.final_score).slice(0, topK);
}
}
Citation Tracking & Hallucination Detection
Without proper tracking, LLMs fabricate sources. Implement citation anchoring and fact validation.
interface Citation {
chunk_id: number;
text: string;
source_doc: string;
start_position: number;
end_position: number;
}
interface GeneratedResponse {
text: string;
citations: Citation[];
confidence: number;
}
class CitationTracker {
async generateWithCitations(
query: string,
retrievedDocs: Array<{ id: number; content: string; source: string }>
): Promise<GeneratedResponse> {
const contextWindow = retrievedDocs
.map((doc, idx) => `[DOC_${idx}] ${doc.content}`)
.join('\n\n');
const systemPrompt = `You MUST cite facts using [DOC_N] tags. Format: "fact [DOC_3: page 2]"
Never invent citations. If unsure, say "I don't have data on this."`;
const prompt = `Context:\n${contextWindow}\n\nQuestion: ${query}`;
const response = await this.callLLM(systemPrompt, prompt);
// Extract citations from response
const citationRegex = /\[DOC_(\d+)(?:: (.+?))?\]/g;
const citations: Citation[] = [];
let match;
while ((match = citationRegex.exec(response)) !== null) {
const docIdx = parseInt(match[1]);
const doc = retrievedDocs[docIdx];
citations.push({
chunk_id: doc.id,
text: match[0],
source_doc: doc.source,
start_position: match.index,
end_position: match.index + match[0].length,
});
}
return {
text: response,
citations,
confidence: this.calculateConfidence(citations, retrievedDocs),
};
}
private calculateConfidence(citations: Citation[], retrieved: any[]): number {
if (citations.length === 0) return 0.2; // No citations = low confidence
const citedDocs = new Set(citations.map(c => c.chunk_id));
return Math.min(1, citedDocs.size / retrieved.length);
}
private async callLLM(system: string, prompt: string): Promise<string> {
// Implementation detail
return '';
}
}
Latency Optimization
RAG latency = embedding (50-300ms) + search (10-50ms) + LLM (500-3000ms). Optimize each layer.
interface OptimizationMetrics {
embedding_latency_ms: number;
search_latency_ms: number;
llm_latency_ms: number;
total_latency_ms: number;
cache_hit: boolean;
}
class OptimizedRAGPipeline {
private embeddingCache: Map<string, number[]> = new Map();
private searchCache: Map<string, any[]> = new Map();
private responseCache: Map<string, string> = new Map();
async executeQuery(query: string): Promise<{ response: string; metrics: OptimizationMetrics }> {
const metrics: OptimizationMetrics = {
embedding_latency_ms: 0,
search_latency_ms: 0,
llm_latency_ms: 0,
total_latency_ms: 0,
cache_hit: false,
};
const start = Date.now();
// 1. Check response cache (eliminates 99% latency)
const cached = this.responseCache.get(query);
if (cached) {
metrics.cache_hit = true;
return { response: cached, metrics };
}
// 2. Parallel embedding + search
const [embedding, search] = await Promise.all([
this.getEmbedding(query),
Promise.resolve(null),
]);
metrics.embedding_latency_ms = Date.now() - start;
// 3. Retrieve documents with timeout
const searchStart = Date.now();
const docs = await Promise.race([
this.vectorStore.search(embedding, 5),
this.timeout(500), // Fail-fast if search takes >500ms
]);
metrics.search_latency_ms = Date.now() - searchStart;
// 4. Stream LLM response
const llmStart = Date.now();
const response = await this.generateResponse(query, docs);
metrics.llm_latency_ms = Date.now() - llmStart;
metrics.total_latency_ms = Date.now() - start;
// Cache successful response (TTL: 24h)
this.responseCache.set(query, response);
return { response, metrics };
}
private async getEmbedding(text: string): Promise<number[]> {
if (this.embeddingCache.has(text)) {
return this.embeddingCache.get(text)!;
}
const embedding = await this.embeddingService.embedText(text);
this.embeddingCache.set(text, embedding);
return embedding;
}
private timeout(ms: number): Promise<never> {
return new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), ms));
}
private async generateResponse(query: string, docs: any[]): Promise<string> {
// LLM call with streaming for better UX
return '';
}
}
RAG Implementation Checklist
- Choose chunking strategy appropriate for your document type
- Select embedding model (OpenAI for MVP, local for scale)
- Deploy pgvector with HNSW/IVFFlat indexes
- Implement semantic search with fallback to keyword search
- Add reranking stage using Cohere API or local model
- Track citations and validate against retrieved sources
- Implement caching at embedding, search, and response levels
- Monitor latency per component; p95 <2s for production
- Add hallucination detection via citation validation
- Set up error budgets and graceful degradation
Conclusion
Production RAG requires balancing retrieval quality with latency. Semantic chunking + local embeddings + pgvector + reranking + citation tracking + caching creates systems that scale to millions of documents while maintaining accuracy and speed. Start simple, measure every component, and optimize based on real production metrics.