System Design: Real-Time Payment Processing at Scale
A deep dive into the architecture behind processing millions of payment transactions per day with sub-second latency and 99.99% availability.
Building a payment processing system that handles millions of transactions daily while maintaining sub-second latency and 99.99% availability is one of the hardest challenges in software engineering. Money movement leaves zero room for error.
In this post, I’ll break down the architecture of a real-time payment processing system, covering everything from API design to disaster recovery.
- Throughput: 10K transactions per second peak
- Latency: P99 < 500ms end-to-end
- Availability: 99.99% (less than 52 minutes downtime per year)
- Consistency: Exactly-once processing semantics
- Compliance: PCI DSS Level 1, SOC 2 Type II
- Idempotency: Safe retries on network failures
Client → API Gateway → Payment Orchestrator → [Fraud Check → Ledger → Settlement]The system follows a pipeline architecture where each stage is independently scalable and fault-isolated.
The gateway handles:
- TLS termination with mutual TLS for partner integrations
- Request validation and schema enforcement
- Rate limiting per API key
- Idempotency key extraction
# Idempotency is enforced at the gateway levelPOST /v1/paymentsHeaders: Idempotency-Key: uuid-v4 X-Request-ID: uuid-v4The orchestrator is the brain of the system. It:
- Validates the idempotency key (deduplicates retries)
- Routes to the appropriate payment rail (card, ACH, wire, real-time)
- Coordinates the fraud check, ledger update, and settlement stages
- Handles compensation on failure (Saga pattern)
We use a state machine to track payment lifecycle:
PENDING → VALIDATING → FRAUD_CHECK → LEDGER_UPDATE → SETTLEMENT → COMPLETED ↓ FAILED → COMPENSATING → REVERSEDThe most critical component is the ledger. Every transaction must be recorded with double-entry bookkeeping — every debit has a corresponding credit. This isn’t just best practice; it’s a regulatory requirement.
You can use a database, but you need to be very careful:
-- WRONG: Race condition between read and writeBEGIN;SELECT balance FROM accounts WHERE id = ?; -- reads 1000-- concurrent transaction also reads 1000UPDATE accounts SET balance = balance - 100 WHERE id = ?; -- sets 900-- other transaction also sets 900, but should be 800COMMIT;The correct approach uses row-level locking or optimistic concurrency control:
-- CORRECT: Atomic update with balance checkUPDATE accountsSET balance = balance - 100, version = version + 1WHERE id = ? AND balance >= 100 AND version = ?;We store every ledger change as an immutable event:
{ "event_id": "evt_abc123", "type": "debit", "account_id": "acc_xyz789", "amount": 10000, "currency": "USD", "reference": "pay_def456", "timestamp": "2026-04-28T10:30:00Z", "balance_after": 490000}This gives us:
- Complete audit trail for compliance
- Ability to reconstruct any account state at any point in time
- Natural integration with event-driven downstream systems
Fraud checks run in parallel with payment validation to minimize latency. Our fraud engine:
- Checks velocity rules (transactions per minute/hour/day)
- Runs ML-based risk scoring
- Validates device fingerprint against known patterns
- Checks against watchlists and sanctions databases
type FraudCheck struct { VelocityRules []VelocityRule RiskScore float64 DeviceTrust DeviceTrustLevel SanctionsMatch bool}
func (fc *FraudCheck) Decision() FraudDecision { if fc.SanctionsMatch { return Reject } if fc.RiskScore > 0.85 { return Review } if fc.ViolatesAnyVelocityRule() { return Reject } return Approve}We run in three regions with active-active traffic distribution:
- US-East (primary)
- US-West (secondary)
- EU-West (tertiary)
Each region can handle 100% of traffic independently. DNS-based failover switches traffic in under 60 seconds.
Ledger events are replicated asynchronously across regions using a conflict-free replicated data type (CRDT) approach. Since ledger events are append-only and ordered by timestamp, conflicts are rare and resolvable.
- Ledger: RPO = 0 (synchronous replication within region)
- Analytics: RPO = 5 minutes (async cross-region)
- Full region failover: RTO < 2 minutes
- Database restore from backup: RTO < 15 minutes
We track four golden signals:
| Signal | Alert Threshold |
|---|---|
| Latency | P99 > 500ms for 5 minutes |
| Traffic | >20% deviation from baseline |
| Errors | >0.1% error rate for 2 minutes |
| Saturation | CPU > 80% for 10 minutes |
Every payment transaction is traced with OpenTelemetry, giving us end-to-end visibility across all services.
- Idempotency is non-negotiable — every endpoint must handle retries safely
- Double-entry bookkeeping prevents entire classes of bugs — treat money like money
- Event sourcing makes compliance audits trivial — regulators love immutable logs
- Design for failure — assume every dependency will fail and plan accordingly
- Measure everything — you can’t improve what you can’t observe
This is part of a series on building production systems. Next up: event-driven architecture patterns for financial services.
Related Posts
Building a Distributed Rate Limiter in Go
How we designed and implemented a token bucket rate limiter that handles 500K+ requests per second across a multi-region deployment.
Migrating from Monolith to Event-Driven Architecture
Lessons learned from decomposing a 2M-line monolithic Java application into event-driven microservices without downtime.