Prompt Engineering in Production
Lessons learned from building production LLM applications with proper prompt management.
When you have one LLM-powered feature, prompt engineering is a notebook exercise. When you have 500+ prompts across 12 microservices, it’s an infrastructure problem.
We learned this the hard way. After a model upgrade from GPT-3.5 to GPT-4, 47 prompts broke in production. Some produced garbled output. Others silently changed behavior. We had no way to test prompts before deploying them, and no versioning to roll back.
Here’s how we built a prompt management system that prevents those failures.
We moved all prompts out of application code and into a centralized registry stored in PostgreSQL with a Git-backed version control layer:
CREATE TABLE prompts ( id UUID PRIMARY KEY, name VARCHAR(255) NOT NULL, service VARCHAR(100) NOT NULL, version INTEGER NOT NULL, template TEXT NOT NULL, model VARCHAR(50) NOT NULL, parameters JSONB, created_at TIMESTAMP DEFAULT NOW(), created_by VARCHAR(100), UNIQUE(name, service, version));
CREATE TABLE prompt_tests ( id UUID PRIMARY KEY, prompt_id UUID REFERENCES prompts(id), test_input TEXT NOT NULL, expected_output TEXT, evaluation_criteria JSONB, created_at TIMESTAMP DEFAULT NOW());Each prompt has a name, service owner, version, template, model specification, and parameters. The version is auto-incremented on every change.
class PromptRegistry: def get_prompt(self, name: str, service: str, version: str = "latest") -> Prompt: if version == "latest": query = """ SELECT * FROM prompts WHERE name = $1 AND service = $2 ORDER BY version DESC LIMIT 1 """ else: query = """ SELECT * FROM prompts WHERE name = $1 AND service = $2 AND version = $3 """ return self.db.fetch(query, name, service, version)
def deploy_prompt(self, prompt: Prompt) -> Prompt: """Deploy only after passing all tests.""" results = self.run_tests(prompt) if not results.passed: raise PromptTestFailed(f"Prompt '{prompt.name}' failed tests: {results.failures}")
return self.db.insert(prompt)Every prompt in the registry must have associated test cases. We evaluate outputs on three dimensions:
from dataclasses import dataclassfrom enum import Enum
class TestType(Enum): EXACT_MATCH = "exact_match" SEMANTIC_SIMILARITY = "semantic_similarity" STRUCTURE_VALIDATION = "structure_validation" SAFETY_CHECK = "security_check"
@dataclassclass PromptTest: prompt_id: str input: str expected: str test_type: TestType threshold: float # 0.0 to 1.0
def evaluate_output(actual: str, test: PromptTest) -> TestResult: if test.test_type == TestType.EXACT_MATCH: score = 1.0 if actual.strip() == test.expected.strip() else 0.0
elif test.test_type == TestType.SEMANTIC_SIMILARITY: score = cosine_similarity( embed(actual), embed(test.expected) )
elif test.test_type == TestType.STRUCTURE_VALIDATION: try: json.loads(actual) score = 1.0 except json.JSONDecodeError: score = 0.0
elif test.test_type == TestType.SAFETY_CHECK: score = 1.0 if not contains_pii(actual) else 0.0
return TestResult(score=score, passed=score >= test.threshold)Our CI pipeline runs all prompt tests on every PR. If a prompt change causes any test to fail, the PR is blocked.
Every prompt change creates a new version. We can roll back any prompt to any previous version instantly:
def rollback_prompt(name: str, service: str, target_version: int): """Roll back a prompt to a previous version.""" current = registry.get_prompt(name, service) target = registry.get_prompt(name, service, version=target_version)
# Deploy the old version as a new version number new_version = Prompt( name=name, service=service, version=current.version + 1, template=target.template, model=target.model, parameters=target.parameters, created_by="rollback-system", )
return registry.deploy_prompt(new_version)This means rollbacks are just another deployment — they go through the same test pipeline and audit trail.
Prompts are tied to specific models, but our services reference prompts by name, not model. This lets us swap models without changing application code:
prompts: code_review_agent: production: model: gpt-4o version: 23 staging: model: gpt-4o-mini version: 23 canary: model: claude-sonnet-4-2026 version: 1We route traffic to different model versions using a simple proxy layer:
class ModelRouter: def __init__(self, config: dict): self.routes = config['prompts']
def get_model(self, prompt_name: str, environment: str) -> str: return self.routes[prompt_name][environment]['model']
async def execute(self, prompt_name: str, input_data: dict, environment: str = "production"): prompt = registry.get_prompt(prompt_name, service=self.service) model = self.get_model(prompt_name, environment)
return await self.llm_client.chat( model=model, messages=format_messages(prompt.template, input_data), **prompt.parameters )We track prompt performance in production with three key metrics:
- Output quality score — running a subset of test cases against production traffic
- Token cost per prompt — tracking cost trends as models and prompts evolve
- Latency distribution — per-prompt P50, P95, P99 latencies
def monitor_prompt_execution(prompt_name: str, result: LLMResponse): metrics.gauge( "prompt.token_cost", result.cost, tags={"prompt": prompt_name, "model": result.model} ) metrics.histogram( "prompt.latency_ms", result.latency_ms, tags={"prompt": prompt_name} )
# Sample 1% of responses for quality evaluation if random.random() < 0.01: quality = evaluate_quality(result.output, prompt_name) metrics.gauge("prompt.quality_score", quality, tags={"prompt": prompt_name})| Metric | Before Registry | After Registry |
|---|---|---|
| Prompt-related incidents | 12/month | 1/month |
| Time to rollback | 45 minutes | 30 seconds |
| Model swap time | 2 days | 5 minutes |
| Test coverage | 0% | 94% |
| Avg prompt iteration speed | 3 days | 4 hours |
- Prompts are code — treat them with the same rigor: versioning, testing, code review
- Test before deploy — never let a prompt reach production without automated tests
- Abstract the model — your services shouldn’t care which model runs the prompt
- Monitor in production — prompt quality drifts as models update and data changes
- Start small — begin with your most critical prompts, then expand coverage incrementally
Want to learn more about MLOps for LLMs? Check out our other posts on AI Engineering.
Related Posts
Building a Production-Ready RAG Pipeline
How we built a retrieval-augmented generation system serving 10K+ queries/day with sub-second latency.
Building an AI Code Review Agent
A practical guide to building an autonomous code review system using LLMs.
Fine-tuning Open Source LLMs
A practical guide to fine-tuning open source LLMs on your own data.