Skip to content
CodeWithNaman Blog

Building a Production-Ready RAG Pipeline

How we built a retrieval-augmented generation system serving 10K+ queries/day with sub-second latency.

AI , Engineering 2 min read

Retrieval-augmented generation (RAG) has become the standard pattern for grounding LLM responses in proprietary data. But moving from a Jupyter notebook prototype to a production system serving 10K+ queries per day with sub-second latency requires solving problems that most tutorials don’t cover.

In this post, I’ll walk through the architecture, optimization decisions, and hard-won lessons from building a RAG pipeline that handles real production traffic.

Our RAG pipeline processes queries through five stages:

Query → Embedding → Vector Search → Reranking → LLM Generation → Response

Each stage has strict latency budgets to meet our P99 target of 800ms:

StageBudgetP99 Actual
Embedding50ms35ms
Vector Search100ms62ms
Reranking150ms110ms
LLM Generation500ms420ms
Orchestration100ms45ms

The orchestration layer runs in Python using FastAPI with async I/O, allowing us to pipeline stages where possible.

The single biggest factor in RAG quality is how you chunk your documents. We tested five strategies before settling on a hybrid approach:

from langchain.text_splitter import RecursiveCharacterTextSplitter
def semantic_chunk(document: str) -> list[str]:
"""Split on semantic boundaries, not arbitrary character counts."""
# First split by headers (H1, H2, H3)
sections = re.split(r'\n#{1,3}\s+', document)
chunks = []
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=80,
separators=["\n\n", "\n", ". ", " ", ""]
)
for section in sections:
if len(section) > 512:
chunks.extend(splitter.split_text(section))
else:
chunks.append(section)
return chunks

Key findings from our A/B testing:

  • 512 tokens with 80 token overlap gave the best retrieval precision
  • Header-aware splitting improved answer accuracy by 23% compared to naive splitting
  • Metadata enrichment (adding document title, section path, and last-modified date) helped the LLM provide more contextual responses

We benchmarked four vector databases before choosing pgvector for our workload. The decision came down to operational simplicity — we already ran PostgreSQL for our transactional data.

import asyncpg
import numpy as np
async def search_vectors(query_embedding: list[float], top_k: int = 5) -> list[dict]:
"""Hybrid search combining vector similarity with full-text search."""
async with asyncpg.connect(DSN) as conn:
rows = await conn.fetch("""
SELECT id, content, metadata,
embedding <=> $1::vector AS vector_score,
ts_rank(to_tsvector('english', content),
plainto_tsquery('english', $2)) AS text_score
FROM document_chunks
WHERE embedding <=> $1::vector < 0.7
ORDER BY vector_score * 0.7 + (1 - text_score) * 0.3
LIMIT $3
""", query_embedding, extract_keywords(query_embedding), top_k * 2)
return [dict(row) for row in rows]

The hybrid scoring (70% vector similarity, 30% BM25 text search) improved recall by 18% over pure vector search, especially for queries containing specific technical terms.

Raw vector search returns candidates that are semantically similar but not always relevant. We added a cross-encoder reranker as a second pass:

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_candidates(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
pairs = [(query, c['content']) for c in candidates]
scores = reranker.predict(pairs)
for candidate, score in zip(candidates, scores):
candidate['rerank_score'] = float(score)
return sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)[:top_k]

This added ~110ms to our pipeline but reduced hallucination rates by 34%. The trade-off was worth it for our use case.

To hit our latency targets, we implemented a multi-level caching strategy:

from diskcache import Cache
import hashlib
query_cache = Cache('/tmp/rag-cache')
def cached_search(query: str) -> str | None:
"""Cache final responses for identical queries."""
key = hashlib.sha256(query.encode()).hexdigest()[:16]
return query_cache.get(key)
def cache_response(query: str, response: str, ttl: int = 3600):
key = hashlib.sha256(query.encode()).hexdigest()[:16]
query_cache.set(key, response, expire=ttl)

Our cache hit rate is ~35% for internal knowledge base queries, which dramatically reduces LLM API costs and improves response times for common questions.

We track RAG quality using three automated metrics:

  1. Context precision — are retrieved chunks actually relevant to the query?
  2. Faithfulness — does the generated answer stay within the retrieved context?
  3. Answer relevance — does the answer actually address the user’s question?

Using RAGAS for automated evaluation, we caught a 15% quality regression when we updated our embedding model — something we would have missed without continuous evaluation.

  1. Chunking matters more than the embedding model — spend time on your splitting strategy
  2. Hybrid search beats pure vector search — combine semantic and keyword matching
  3. Reranking is worth the latency cost — cross-encoders dramatically improve precision
  4. Cache aggressively — many queries repeat, and caching saves both latency and cost
  5. Evaluate continuously — RAG quality degrades silently as your document corpus grows

Have questions about building RAG systems? Reach out on GitHub or Twitter.

Related Posts