Building a Real-Time Stock Trading Engine

Low-latency order matching engine handling 1M+ orders/day with deterministic execution.

March 20, 2026 FinTech , Engineering 2 min read

A trading engine is one of the most demanding systems you can build. It must match buy and sell orders in microseconds, maintain perfect consistency, and never lose an order. We built an order matching engine in Rust that handles 1M+ orders daily with deterministic execution and P99 latency under 50 microseconds.

Here’s the architecture and the decisions that mattered.

The matching engine runs as a single-threaded event loop. This eliminates concurrency bugs and ensures deterministic ordering — the same sequence of inputs always produces the same sequence of outputs.

struct TradingEngine {
    order_books: HashMap<Symbol, OrderBook>,
    event_log: EventLog,
    sequence: u64,
}

impl TradingEngine {
    fn process_event(&mut self, event: OrderEvent) -> Vec<ExecutionReport> {
        self.sequence += 1;

        let reports = match event {
            OrderEvent::NewOrder(order) => {
                self.order_books
                    .entry(order.symbol)
                    .or_insert_with(|| OrderBook::new(order.symbol))
                    .add_order(order, self.sequence)
            }
            OrderEvent::CancelOrder(cancel) => {
                self.order_books
                    .get_mut(&cancel.symbol)
                    .map(|book| book.cancel_order(cancel))
                    .unwrap_or_default()
            }
        };

        // Log all events for replay
        self.event_log.append(self.sequence, &event, &reports);

        reports
    }
}

Single-threaded might sound like a limitation, but a modern CPU core can process millions of simple operations per second. Our bottleneck is never CPU — it’s network I/O.

The order book maintains buy and sell orders with price-time priority:

use std::collections::BTreeMap;

struct OrderBook {
    symbol: Symbol,
    bids: BTreeMap<Price, PriceLevel>,  // Descending price
    asks: BTreeMap<Price, PriceLevel>,  // Ascending price
    order_index: HashMap<OrderId, (Price, Side)>,
}

struct PriceLevel {
    orders: VecDeque<Order>,
    total_volume: u64,
}

impl OrderBook {
    fn add_order(&mut self, order: Order, sequence: u64) -> Vec<ExecutionReport> {
        let mut reports = Vec::new();

        // Try to match against the opposite side
        let opposite = if order.side == Side::Buy {
            &mut self.asks
        } else {
            &mut self.bids
        };

        while order.remaining_volume() > 0 {
            let best_price = match order.side {
                Side::Buy => opposite.keys().next().copied(),
                Side::Sell => opposite.keys().next_back().copied(),
            };

            let Some(best) = best_price else { break };

            // Check price compatibility
            if order.side == Side::Buy && best > order.price { break; }
            if order.side == Side::Sell && best < order.price { break; }

            // Execute trade
            let level = opposite.get_mut(&best).unwrap();
            let matched = level.orders.front_mut().unwrap();
            let fill_qty = order.remaining_volume().min(matched.remaining_volume());

            matched.fill(fill_qty, sequence);
            reports.push(ExecutionReport::fill(order.id, matched.id, best, fill_qty));

            if matched.is_filled() {
                level.orders.pop_front();
                if level.orders.is_empty() {
                    opposite.remove(&best);
                }
            }
        }

        // Rest goes to the book
        if order.remaining_volume() > 0 {
            self.add_to_book(order);
            reports.push(ExecutionReport::accepted(order.id));
        }

        reports
    }
}

Using BTreeMap gives us O(log n) price lookups and automatic price ordering. For the highest-performance implementations, you’d use a fixed-size array with direct indexing, but BTreeMap is fast enough for most use cases.

The engine publishes real-time market data for every state change:

struct MarketDataPublisher {
    udp_socket: UdpSocket,
    sequence: u64,
}

impl MarketDataPublisher {
    fn publish_trade(&mut self, trade: Trade) {
        self.sequence += 1;

        let msg = TradeMessage {
            msg_type: MessageType::Trade,
            sequence: self.sequence,
            timestamp: Instant::now(),
            symbol: trade.symbol,
            price: trade.price,
            quantity: trade.quantity,
            aggressor: trade.aggressor_side,
        };

        // Serialize to binary format (no JSON — too slow)
        let bytes = msg.to_bytes();
        self.udp_socket.send_to(&bytes, "239.0.0.1:5001").unwrap();
    }
}

Market data is published via UDP multicast. Subscribers (trading algorithms, dashboards, risk systems) listen to the multicast group and receive updates in real-time.

The engine maintains an append-only event log for crash recovery:

struct EventLog {
    file: File,
    buffer: Vec<u8>,
}

impl EventLog {
    fn append(&mut self, sequence: u64, event: &OrderEvent, reports: &[ExecutionReport]) {
        let entry = LogEntry {
            sequence,
            timestamp: Instant::now(),
            event: event.clone(),
            reports: reports.to_vec(),
        };

        self.buffer.extend(entry.to_bytes());

        // Flush every 1000 entries or 10ms
        if self.buffer.len() > 65536 {
            self.flush();
        }
    }

    fn flush(&mut self) {
        self.file.write_all(&self.buffer).unwrap();
        self.file.sync_all().unwrap();  // fsync
        self.buffer.clear();
    }
}

On restart, the engine replays the event log to reconstruct the order book state. A full day’s activity (~1M events) replays in under 30 seconds.

Key optimizations for sub-50μs latency:

// 1. Pre-allocate memory (no allocations in the hot path)
let mut report_pool = ObjectPool::new(10000, ExecutionReport::default);

// 2. Use stack allocation for small messages
#[repr(C)]
struct OrderMessage {
    msg_type: u8,
    symbol: [u8; 8],
    order_id: u64,
    price: u64,
    quantity: u32,
    side: u8,
} // 32 bytes — fits in cache line

// 3. Busy-wait spinlock instead of mutex for shared state
struct SpinLock<T> {
    inner: UnsafeCell<T>,
    locked: AtomicBool,
}

// 4. Batch network I/O
fn process_batch(&mut self, messages: &[OrderMessage]) -> Vec<ExecutionReport> {
    let mut all_reports = Vec::with_capacity(messages.len() * 2);
    for msg in messages {
        all_reports.extend(self.process_message(msg));
    }
    all_reports
}

Single-threaded is simpler and fast enough — don’t add concurrency until you’ve measured
Deterministic execution is non-negotiable — replay must produce identical results
Pre-allocate everything — heap allocation in the hot path adds unpredictable latency
Binary protocols over JSON — serialization speed matters at microsecond scale
Test with production traffic replay — synthetic tests don’t capture real-world patterns

Questions about trading systems? Find me on GitHub or Twitter.

Market Data Feed Processing

High-throughput market data feed handling with low latency.

trading market-data low-latency

Order Book Implementation

Designing and implementing a high-performance order book for trading systems.

trading order-book low-latency

Algorithmic Trading Backtesting

Building a robust backtesting framework for algorithmic trading strategies.

trading backtesting algorithms

Core Design: Single-Threaded Event Loop

Order Book Implementation

Market Data Feed

Persistence and Recovery

Latency Optimization

Lessons Learned

Related Posts

Market Data Feed Processing

Order Book Implementation

Algorithmic Trading Backtesting