Building a Production-Ready RAG Pipeline
How we built a retrieval-augmented generation system serving 10K+ queries/day with sub-second latency.
Retrieval-augmented generation (RAG) has become the standard pattern for grounding LLM responses in proprietary data. But moving from a Jupyter notebook prototype to a production system serving 10K+ queries per day with sub-second latency requires solving problems that most tutorials don’t cover.
In this post, I’ll walk through the architecture, optimization decisions, and hard-won lessons from building a RAG pipeline that handles real production traffic.
Our RAG pipeline processes queries through five stages:
Query → Embedding → Vector Search → Reranking → LLM Generation → ResponseEach stage has strict latency budgets to meet our P99 target of 800ms:
| Stage | Budget | P99 Actual |
|---|---|---|
| Embedding | 50ms | 35ms |
| Vector Search | 100ms | 62ms |
| Reranking | 150ms | 110ms |
| LLM Generation | 500ms | 420ms |
| Orchestration | 100ms | 45ms |
The orchestration layer runs in Python using FastAPI with async I/O, allowing us to pipeline stages where possible.
The single biggest factor in RAG quality is how you chunk your documents. We tested five strategies before settling on a hybrid approach:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def semantic_chunk(document: str) -> list[str]: """Split on semantic boundaries, not arbitrary character counts.""" # First split by headers (H1, H2, H3) sections = re.split(r'\n#{1,3}\s+', document)
chunks = [] splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=80, separators=["\n\n", "\n", ". ", " ", ""] )
for section in sections: if len(section) > 512: chunks.extend(splitter.split_text(section)) else: chunks.append(section)
return chunksKey findings from our A/B testing:
- 512 tokens with 80 token overlap gave the best retrieval precision
- Header-aware splitting improved answer accuracy by 23% compared to naive splitting
- Metadata enrichment (adding document title, section path, and last-modified date) helped the LLM provide more contextual responses
We benchmarked four vector databases before choosing pgvector for our workload. The decision came down to operational simplicity — we already ran PostgreSQL for our transactional data.
import asyncpgimport numpy as np
async def search_vectors(query_embedding: list[float], top_k: int = 5) -> list[dict]: """Hybrid search combining vector similarity with full-text search.""" async with asyncpg.connect(DSN) as conn: rows = await conn.fetch(""" SELECT id, content, metadata, embedding <=> $1::vector AS vector_score, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $2)) AS text_score FROM document_chunks WHERE embedding <=> $1::vector < 0.7 ORDER BY vector_score * 0.7 + (1 - text_score) * 0.3 LIMIT $3 """, query_embedding, extract_keywords(query_embedding), top_k * 2)
return [dict(row) for row in rows]The hybrid scoring (70% vector similarity, 30% BM25 text search) improved recall by 18% over pure vector search, especially for queries containing specific technical terms.
Raw vector search returns candidates that are semantically similar but not always relevant. We added a cross-encoder reranker as a second pass:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_candidates(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]: pairs = [(query, c['content']) for c in candidates] scores = reranker.predict(pairs)
for candidate, score in zip(candidates, scores): candidate['rerank_score'] = float(score)
return sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)[:top_k]This added ~110ms to our pipeline but reduced hallucination rates by 34%. The trade-off was worth it for our use case.
To hit our latency targets, we implemented a multi-level caching strategy:
from diskcache import Cacheimport hashlib
query_cache = Cache('/tmp/rag-cache')
def cached_search(query: str) -> str | None: """Cache final responses for identical queries.""" key = hashlib.sha256(query.encode()).hexdigest()[:16] return query_cache.get(key)
def cache_response(query: str, response: str, ttl: int = 3600): key = hashlib.sha256(query.encode()).hexdigest()[:16] query_cache.set(key, response, expire=ttl)Our cache hit rate is ~35% for internal knowledge base queries, which dramatically reduces LLM API costs and improves response times for common questions.
We track RAG quality using three automated metrics:
- Context precision — are retrieved chunks actually relevant to the query?
- Faithfulness — does the generated answer stay within the retrieved context?
- Answer relevance — does the answer actually address the user’s question?
Using RAGAS for automated evaluation, we caught a 15% quality regression when we updated our embedding model — something we would have missed without continuous evaluation.
- Chunking matters more than the embedding model — spend time on your splitting strategy
- Hybrid search beats pure vector search — combine semantic and keyword matching
- Reranking is worth the latency cost — cross-encoders dramatically improve precision
- Cache aggressively — many queries repeat, and caching saves both latency and cost
- Evaluate continuously — RAG quality degrades silently as your document corpus grows
Have questions about building RAG systems? Reach out on GitHub or Twitter.
Related Posts
Building an AI Code Review Agent
A practical guide to building an autonomous code review system using LLMs.
Fine-tuning Open Source LLMs
A practical guide to fine-tuning open source LLMs on your own data.
Prompt Engineering in Production
Lessons learned from building production LLM applications with proper prompt management.