Distributed Tracing & Observability
Implementing distributed tracing and observability in microservices.
When you have 50+ microservices, a single user request can traverse a dozen services before returning a response. When that request fails or is slow, finding the root cause without distributed tracing is like debugging with your eyes closed.
We implemented OpenTelemetry across our entire platform over six months. Here’s the practical guide I wish we had when we started.
OpenTelemetry (OTel) is the industry standard for observability. It provides:
- Vendor-neutral instrumentation — switch backends without changing code
- Unified API for traces, metrics, and logs
- Automatic context propagation — trace IDs flow through HTTP, gRPC, and message queues
- Massive ecosystem — auto-instrumentation for 40+ languages and frameworks
The foundation of distributed tracing is propagating trace context across service boundaries. OTel uses W3C Trace Context headers:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 │ │ │ │ │ │ │ └─ flags (sampled) │ │ └─ parent span ID │ └─ trace ID └─ versionEvery service must extract this header from incoming requests and inject it into outgoing calls:
from opentelemetry import tracefrom opentelemetry.propagate import extract, injectimport httpx
tracer = trace.get_tracer("order-service")
async def process_order(order_id: str, headers: dict): # Extract trace context from incoming request ctx = extract(headers)
with tracer.start_as_current_span("process_order", context=ctx) as span: span.set_attribute("order.id", order_id)
# Call payment service — inject trace context outgoing_headers = {} inject(outgoing_headers)
async with httpx.AsyncClient() as client: response = await client.post( "http://payment-service/api/charge", json={"order_id": order_id, "amount": 9999}, headers=outgoing_headers, )
span.set_attribute("payment.status", response.status_code) return response.json()Before writing manual spans, enable auto-instrumentation. It captures HTTP clients, database queries, and framework handlers for free:
# At application startupfrom opentelemetry.instrumentation.fastapi import FastAPIInstrumentorfrom opentelemetry.instrumentation.httpx import HTTPXClientInstrumentorfrom opentelemetry.instrumentation.asyncpg import AsyncPGInstrumentor
FastAPIInstrumentor.instrument_app(app)HTTPXClientInstrumentor().instrument()AsyncPGInstrumentor().instrument()This alone gave us visibility into 80% of our service interactions with zero code changes.
Sampling is critical — tracing every request at scale is prohibitively expensive. We use a tiered approach:
sampling: default: type: probabilistic rate: 0.01 # 1% of all requests errors: type: always_on # 100% of failed requests slow: type: always_on # 100% of requests > 1s high_value: type: always_on # 100% of premium customer requestsfrom opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# Default: 1% samplingdefault_sampler = ParentBased(TraceIdRatioBased(0.01))
# Custom sampler: always sample errors and slow requestsclass SmartSampler: def should_sample(self, span_context, trace_id, name, attributes): # Always sample if it's an error span if attributes.get("error"): return ALWAYS_ON
# Always sample if it's slow if attributes.get("duration_ms", 0) > 1000: return ALWAYS_ON
# Otherwise use probabilistic sampling return default_sampler.should_sample( span_context, trace_id, name, attributes )This gives us complete visibility into problems while keeping storage costs manageable.
Auto-instrumentation covers infrastructure, but you need manual spans for business logic:
async def reconcile_payments(order_id: str): with tracer.start_as_current_span("reconcile_payments") as span: span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("fetch_gateway_records") as child: records = await fetch_from_gateways(order_id) child.set_attribute("records.count", len(records))
with tracer.start_as_current_span("compare_balances") as child: mismatches = find_mismatches(records) child.set_attribute("mismatches.count", len(mismatches))
if mismatches: span.set_attribute("reconciliation.status", "failed") span.add_event("reconciliation_failed", { "mismatch_count": len(mismatches), }) else: span.set_attribute("reconciliation.status", "success")Tracing across async boundaries requires explicit context propagation:
# Producer: inject trace context into message headersdef publish_event(event: dict): headers = {} inject(headers) # OTel propagation event['headers'] = headers kafka_producer.send('orders.events', value=event)
# Consumer: extract trace context and create span@consumer('orders.events')def handle_event(message): ctx = extract(message.headers)
with tracer.start_as_current_span( "handle_order_event", context=ctx, kind=trace.SpanKind.CONSUMER, ) as span: span.set_attribute("event.type", message.value['event_type']) process_event(message.value)With traces in place, debugging becomes dramatically faster:
Trace: POST /api/orders (2.3s total)├── api-gateway: handle_request (2.3s)│ ├── auth: validate_token (12ms)│ ├── rate-limiter: check (3ms)│ └── order-service: create_order (2.2s)│ ├── db: insert_order (45ms)│ ├── payment-service: charge (1.8s) ← SLOW│ │ ├── stripe: create_payment_intent (1.7s) ← ROOT CAUSE│ │ └── db: record_payment (50ms)│ └── notification-service: send_email (200ms)The trace immediately shows that Stripe’s create_payment_intent is the bottleneck — not our code.
- Start with auto-instrumentation — get 80% visibility with zero code changes
- Sample intelligently — always sample errors, probabilistically sample the rest
- Propagate context everywhere — HTTP, gRPC, message queues, even background jobs
- Add business attributes — order IDs, user IDs, and tenant IDs make traces actionable
- Set up trace-based alerts — alert on trace patterns, not just metrics
Questions about observability? Find me on GitHub or Twitter.
Related Posts
System Design: Real-Time Payment Processing at Scale
A deep dive into the architecture behind processing millions of payment transactions per day with sub-second latency and 99.99% availability.
Event Sourcing & CQRS Pattern
Deep dive into event sourcing and CQRS patterns for building scalable systems.
Monolith to Event-Driven Architecture
Step-by-step guide to migrating from a monolith to an event-driven architecture.