System Design: Real-Time Payment Processing at Scale

A deep dive into the architecture behind processing millions of payment transactions per day with sub-second latency and 99.99% availability.

April 28, 2026 System Design , FinTech 1 min read

Building a payment processing system that handles millions of transactions daily while maintaining sub-second latency and 99.99% availability is one of the hardest challenges in software engineering. Money movement leaves zero room for error.

In this post, I’ll break down the architecture of a real-time payment processing system, covering everything from API design to disaster recovery.

Throughput: 10K transactions per second peak
Latency: P99 < 500ms end-to-end
Availability: 99.99% (less than 52 minutes downtime per year)
Consistency: Exactly-once processing semantics
Compliance: PCI DSS Level 1, SOC 2 Type II
Idempotency: Safe retries on network failures

Client → API Gateway → Payment Orchestrator → [Fraud Check → Ledger → Settlement]

The system follows a pipeline architecture where each stage is independently scalable and fault-isolated.

The gateway handles:

TLS termination with mutual TLS for partner integrations
Request validation and schema enforcement
Rate limiting per API key
Idempotency key extraction

# Idempotency is enforced at the gateway level
POST /v1/payments
Headers:
  Idempotency-Key: uuid-v4
  X-Request-ID: uuid-v4

The orchestrator is the brain of the system. It:

Validates the idempotency key (deduplicates retries)
Routes to the appropriate payment rail (card, ACH, wire, real-time)
Coordinates the fraud check, ledger update, and settlement stages
Handles compensation on failure (Saga pattern)

We use a state machine to track payment lifecycle:

PENDING → VALIDATING → FRAUD_CHECK → LEDGER_UPDATE → SETTLEMENT → COMPLETED
                                                        ↓
                                                 FAILED → COMPENSATING → REVERSED

The most critical component is the ledger. Every transaction must be recorded with double-entry bookkeeping — every debit has a corresponding credit. This isn’t just best practice; it’s a regulatory requirement.

You can use a database, but you need to be very careful:

-- WRONG: Race condition between read and write
BEGIN;
SELECT balance FROM accounts WHERE id = ?; -- reads 1000
-- concurrent transaction also reads 1000
UPDATE accounts SET balance = balance - 100 WHERE id = ?; -- sets 900
-- other transaction also sets 900, but should be 800
COMMIT;

The correct approach uses row-level locking or optimistic concurrency control:

-- CORRECT: Atomic update with balance check
UPDATE accounts
SET balance = balance - 100,
    version = version + 1
WHERE id = ? AND balance >= 100 AND version = ?;

We store every ledger change as an immutable event:

{
  "event_id": "evt_abc123",
  "type": "debit",
  "account_id": "acc_xyz789",
  "amount": 10000,
  "currency": "USD",
  "reference": "pay_def456",
  "timestamp": "2026-04-28T10:30:00Z",
  "balance_after": 490000
}

This gives us:

Complete audit trail for compliance
Ability to reconstruct any account state at any point in time
Natural integration with event-driven downstream systems

Fraud checks run in parallel with payment validation to minimize latency. Our fraud engine:

Checks velocity rules (transactions per minute/hour/day)
Runs ML-based risk scoring
Validates device fingerprint against known patterns
Checks against watchlists and sanctions databases

type FraudCheck struct {
    VelocityRules  []VelocityRule
    RiskScore      float64
    DeviceTrust    DeviceTrustLevel
    SanctionsMatch bool
}

func (fc *FraudCheck) Decision() FraudDecision {
    if fc.SanctionsMatch {
        return Reject
    }
    if fc.RiskScore > 0.85 {
        return Review
    }
    if fc.ViolatesAnyVelocityRule() {
        return Reject
    }
    return Approve
}

We run in three regions with active-active traffic distribution:

US-East (primary)
US-West (secondary)
EU-West (tertiary)

Each region can handle 100% of traffic independently. DNS-based failover switches traffic in under 60 seconds.

Ledger events are replicated asynchronously across regions using a conflict-free replicated data type (CRDT) approach. Since ledger events are append-only and ordered by timestamp, conflicts are rare and resolvable.

Ledger: RPO = 0 (synchronous replication within region)
Analytics: RPO = 5 minutes (async cross-region)

Full region failover: RTO < 2 minutes
Database restore from backup: RTO < 15 minutes

We track four golden signals:

Signal	Alert Threshold
Latency	P99 > 500ms for 5 minutes
Traffic	>20% deviation from baseline
Errors	>0.1% error rate for 2 minutes
Saturation	CPU > 80% for 10 minutes

Every payment transaction is traced with OpenTelemetry, giving us end-to-end visibility across all services.

Idempotency is non-negotiable — every endpoint must handle retries safely
Double-entry bookkeeping prevents entire classes of bugs — treat money like money
Event sourcing makes compliance audits trivial — regulators love immutable logs
Design for failure — assume every dependency will fail and plan accordingly
Measure everything — you can’t improve what you can’t observe

This is part of a series on building production systems. Next up: event-driven architecture patterns for financial services.

Building a Distributed Rate Limiter in Go

How we designed and implemented a token bucket rate limiter that handles 500K+ requests per second across a multi-region deployment.

go distributed-systems rate-limiting

Migrating from Monolith to Event-Driven Architecture

Lessons learned from decomposing a 2M-line monolithic Java application into event-driven microservices without downtime.

architecture microservices event-driven

Related Posts

Building a Distributed Rate Limiter in Go

Migrating from Monolith to Event-Driven Architecture