Building a Production-Ready RAG Pipeline

How we built a retrieval-augmented generation system serving 10K+ queries/day with sub-second latency.

May 8, 2026 AI , Engineering 2 min read

Retrieval-augmented generation (RAG) has become the standard pattern for grounding LLM responses in proprietary data. But moving from a Jupyter notebook prototype to a production system serving 10K+ queries per day with sub-second latency requires solving problems that most tutorials don’t cover.

In this post, I’ll walk through the architecture, optimization decisions, and hard-won lessons from building a RAG pipeline that handles real production traffic.

Our RAG pipeline processes queries through five stages:

Query → Embedding → Vector Search → Reranking → LLM Generation → Response

Each stage has strict latency budgets to meet our P99 target of 800ms:

Stage	Budget	P99 Actual
Embedding	50ms	35ms
Vector Search	100ms	62ms
Reranking	150ms	110ms
LLM Generation	500ms	420ms
Orchestration	100ms	45ms

The orchestration layer runs in Python using FastAPI with async I/O, allowing us to pipeline stages where possible.

The single biggest factor in RAG quality is how you chunk your documents. We tested five strategies before settling on a hybrid approach:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def semantic_chunk(document: str) -> list[str]:
    """Split on semantic boundaries, not arbitrary character counts."""
    # First split by headers (H1, H2, H3)
    sections = re.split(r'\n#{1,3}\s+', document)

    chunks = []
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=80,
        separators=["\n\n", "\n", ". ", " ", ""]
    )

    for section in sections:
        if len(section) > 512:
            chunks.extend(splitter.split_text(section))
        else:
            chunks.append(section)

    return chunks

Key findings from our A/B testing:

512 tokens with 80 token overlap gave the best retrieval precision
Header-aware splitting improved answer accuracy by 23% compared to naive splitting
Metadata enrichment (adding document title, section path, and last-modified date) helped the LLM provide more contextual responses

We benchmarked four vector databases before choosing pgvector for our workload. The decision came down to operational simplicity — we already ran PostgreSQL for our transactional data.

import asyncpg
import numpy as np

async def search_vectors(query_embedding: list[float], top_k: int = 5) -> list[dict]:
    """Hybrid search combining vector similarity with full-text search."""
    async with asyncpg.connect(DSN) as conn:
        rows = await conn.fetch("""
            SELECT id, content, metadata,
                   embedding <=> $1::vector AS vector_score,
                   ts_rank(to_tsvector('english', content),
                           plainto_tsquery('english', $2)) AS text_score
            FROM document_chunks
            WHERE embedding <=> $1::vector < 0.7
            ORDER BY vector_score * 0.7 + (1 - text_score) * 0.3
            LIMIT $3
        """, query_embedding, extract_keywords(query_embedding), top_k * 2)

        return [dict(row) for row in rows]

The hybrid scoring (70% vector similarity, 30% BM25 text search) improved recall by 18% over pure vector search, especially for queries containing specific technical terms.

Raw vector search returns candidates that are semantically similar but not always relevant. We added a cross-encoder reranker as a second pass:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_candidates(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    pairs = [(query, c['content']) for c in candidates]
    scores = reranker.predict(pairs)

    for candidate, score in zip(candidates, scores):
        candidate['rerank_score'] = float(score)

    return sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)[:top_k]

This added ~110ms to our pipeline but reduced hallucination rates by 34%. The trade-off was worth it for our use case.

To hit our latency targets, we implemented a multi-level caching strategy:

from diskcache import Cache
import hashlib

query_cache = Cache('/tmp/rag-cache')

def cached_search(query: str) -> str | None:
    """Cache final responses for identical queries."""
    key = hashlib.sha256(query.encode()).hexdigest()[:16]
    return query_cache.get(key)

def cache_response(query: str, response: str, ttl: int = 3600):
    key = hashlib.sha256(query.encode()).hexdigest()[:16]
    query_cache.set(key, response, expire=ttl)

Our cache hit rate is ~35% for internal knowledge base queries, which dramatically reduces LLM API costs and improves response times for common questions.

We track RAG quality using three automated metrics:

Context precision — are retrieved chunks actually relevant to the query?
Faithfulness — does the generated answer stay within the retrieved context?
Answer relevance — does the answer actually address the user’s question?

Using RAGAS for automated evaluation, we caught a 15% quality regression when we updated our embedding model — something we would have missed without continuous evaluation.

Chunking matters more than the embedding model — spend time on your splitting strategy
Hybrid search beats pure vector search — combine semantic and keyword matching
Reranking is worth the latency cost — cross-encoders dramatically improve precision
Cache aggressively — many queries repeat, and caching saves both latency and cost
Evaluate continuously — RAG quality degrades silently as your document corpus grows

Have questions about building RAG systems? Reach out on GitHub or Twitter.

Building an AI Code Review Agent

A practical guide to building an autonomous code review system using LLMs.

ai llm code-review

Fine-tuning Open Source LLMs

A practical guide to fine-tuning open source LLMs on your own data.

ai llm fine-tuning

Prompt Engineering in Production

Lessons learned from building production LLM applications with proper prompt management.

ai llm prompt-engineering

The Architecture

Chunking Strategy

Vector Search Optimization

Reranking for Precision

Caching and Latency Reduction

Monitoring and Evaluation

Lessons Learned

Related Posts

Building an AI Code Review Agent

Fine-tuning Open Source LLMs

Prompt Engineering in Production