Choosing an MPSC Queue for a Centralized Trading Engine

I started this trading engine with three constraints in mind:

A centralized engine
Supports multiple strategies
Low latency

Benchmark code repo: github.com/junhjang/nonblocking-dispatch-bench

Strategies run as separate processes so they stay isolated and observable. Each venue gateway also runs on its own thread, so a slow or broken venue does not block other venues.

The Order Engine, simplified, has two layers: Orchestrator and Gateway. The real engine has more layers, but by role they group into these two. A gateway is created per (Venue, Credential). The Orchestrator is the central event loop every order passes through. Risk check, position tracking, and kill switch all sit in one place, and observability also ends there.

Strategies are isolated as processes. The Orchestrator and Gateways are threads in the same process, pinned to cores.

Following the pipeline, there are four hand-off points:

submit_q (strategy → orchestrator): N → 1, MPSC
gateway_inbox (orchestrator → gateway X): 1 → 1, SPSC. One per gateway (M total)
completion_q (gateway → orchestrator): M → 1, MPSC
return_q (orchestrator → strategy Y): 1 → 1, SPSC. One per strategy (N total)

Contention happens at #1 and #3, the two MPSC spots. These are what this benchmark measures.

1. Non-blocking candidates

I picked 4 candidates. They are the common choices for MPSC hand-off in Rust.

Candidate	Structure	lock-free?	bounded?
① std::sync::mpsc	linked list, MPSC	both	unbounded
② crossbeam-channel	array ring, MPMC	both	bounded
③ ArrayQueue	array ring, MPMC	YES	bounded
④ per-producer SPSC + mux	N SPSC rings	YES	bounded

Terms you will see often

lock-free: many threads can touch the same data structure at once without locks, safely. Done with atomic ops.
atomic op: an op that cannot be cut in the middle. The CPU handles it in one shot.
CAS (Compare-And-Swap): an atomic op that says "if memory holds X, change it to Y". If many threads try at once, only one wins; the rest see the failure.
park: the OS putting a thread to sleep. The thread uses no CPU time and waits to be woken up.
ring: a fixed-size array that wraps back to the start when it reaches the end. Marked as ↺ in the tables and figures.
both (in the table): supports a lock-free call (fails fast) and a blocking call (parks until a slot opens). YES means lock-free only.

① std::sync::mpsc

linked nodes (heap-allocated), unbounded

→

●

→

●

→

●

→

Pros: Rust standard library (no external crate), unbounded (push is never refused)
Cons (latency): allocates per block (not per push, but still a cost); pop chases pointers across the heap and hits cache misses

P calls push

↓

alloc new node from heap (per-block, amortized)

↓

atomic link to list tail (CAS)

② crossbeam-channel

bounded ring inside channel wrapper

→

channel wrapper

●

↺

+ park / select / timeout

→

Pros: supports blocking, select, and timeout. The inner ring is the same algorithm family as ③
Cons (latency): this bench uses only non-blocking push, but every call still goes through the wrapper machinery (notify, etc). You pay for features you do not use

P calls push

↓

CAS on ring tail (claim slot + write value)

↓

wrapper notify machinery (check sleeping receivers, unused but always traversed)

↓

return Ok or Err(Full)

③ crossbeam_queue::ArrayQueue

bare bounded ring (same shape, no wrapper)

→

●

↺

CAS only, no park

→

Pros (latency): one call = one CAS. No wrapper, so push/pop cost is minimal
Cons: no blocking call. The caller has to handle backpressure

P calls push

↓

CAS on ring tail (claim slot + write value)

↓

return Ok or Err(Full)

④ per-producer SPSC + mux

N bounded SPSC rings, consumer scans all

→

ring 1

→

ring 2

→

ring 3

→

C scans N

Pros (latency): producers never touch the same atomic, so push contention is zero and latency is stable
Cons (latency): the consumer has to scan N queues each loop, so pop latency grows with N

each P does atomic store to own ring tail (no CAS needed, SPSC)

↓

isolated from other producers' atomics, so push contention is zero

↓

consumer peeks N rings sequentially (round-robin)

2. Experiment setup

The bench mimics a real trading setup but does not talk to a real exchange. Each venue is abstracted as a single delay D. When an order arrives at a gateway, it is held for D ~ Uniform(5ms, 500ms) and then returned. This rolls the network and matching-engine round trip into one distribution.

To compare all 4 methods on the same input, I pre-sampled (per-strategy fire times + per-venue delays) once at the start and replayed it 4 times. Same effect as fixing the seed.

Setup: N (strategy) = 8, M (gateway) = 4, queue capacity = 8192, orders / run = 4M (= N × 500K), fire window = 10 s, D ~ U(5ms, 500ms), W sweep = 0 / 500 / 1000 / 1500 ns, repeats R = 5.

pre-sampled schedule

[ (fire_ts, strategy, venue, D), ... ]

↓ replayed identically

run 1
std::sync::mpsc

run 2
crossbeam-channel

run 3
ArrayQueue

run 4
SPSC + mux

repeated R times (R = 5), reported as median

All three thread types run sync code, busy-polling. The structure is almost the same: pop from input queue, do work, push to output queue.

Gateway has one difference. When an order arrives, it sets release_at = now + D and puts it in a min-heap. Each loop, if the heap top is at or before now, it pops and forwards.

# Strategy thread (one per strategy)
for (fire_ts, dest_venue) in pre_sampled_schedule:
    while now() < fire_ts:
        spin_hint()
    order = build(fire_ts, dest_venue)
    submit_q.try_push(order)        # MPSC: contended hand-off (benchmark target)

# Orchestrator thread (single)
loop:
    # consume from strategies
    if order := submit_q.try_pop():
        simulate_work(orch_work)                          # CPU work per order (see section 5)
        gateway_inbox[order.venue].try_push(order)        # SPSC: routed fan-out

    # consume completions from gateways
    if completion := completion_q.try_pop():              # MPSC: contended hand-off (benchmark target)
        return_q[completion.owner].try_push(completion)   # SPSC

# Gateway thread (one per venue)
heap = MinHeap()       # ordered by release_at
loop:
    # accept new orders
    while order := gateway_inbox.try_pop():
        order.release_at = now() + D
        heap.push(order)                                  # O(log N)

    # release matured orders
    while not heap.empty() and heap.peek().release_at <= now():   # O(1) peek
        ready = heap.pop()                                # O(log N)
        completion_q.try_push(ready)                      # MPSC: benchmark target

3. Latency measurement

Each order carries 8 timestamps, splitting the round trip into 7 segments.

t1strategy_send

submit_latencyMPSC bench

t2orch_recv_submit

orch_pre_dispatchinternal

t3orch_dispatch

dispatch_latencyfixed SPSC

t4gateway_recv

venue_servicevenue D

t5gateway_done

completion_latencyMPSC bench

t6orch_recv_completion

orch_post_processinternal

t7orch_return

return_latencyfixed SPSC

t8strategy_recv

round_trip = t8 − t1

rt-venue = round_trip − venue_service = (t8 − t1) − (t5 − t4)

4. Baseline results (orch_work = 0)

Start with the baseline where the orchestrator does no work (W = 0). Unit is µs.

Candidate	p50	p99	p99.9
① std::sync::mpsc	1.2	2.4	4.1
② crossbeam-channel	1.3	2.3	3.7
③ ArrayQueue	1.1	1.8	3.1
④ SPSC + mux	1.1	2.1	3.3

ArrayQueue is the lowest, but all 4 sit around 2µs at p99. Practically tied.

The reason: the orchestrator is idle. It does almost nothing per order (pop → push). So queue depth stays near 0, and every measurement is really “hand-off on an empty queue”. The real difference between the 4 methods shows up when queue depth builds up.

5. orch_work sweep results

A real orchestrator does risk check, routing, and book updates per order. To mimic that cost, I added orch_work (W) and swept it over 0 / 500 / 1000 / 1500 ns. Look at rt-venue p99 by W (unit µs). The table below shows raw rt-venue p99; the chart shows rt-venue p99 − W.

orch_work	① std::mpsc	② crossbeam	③ ArrayQueue	④ SPSC + mux
0 ns	2.4	2.3	1.8	2.1
500 ns	6.1	6.1	5.1	6.0
1000 ns	19.5	20.5	17.0	19.8
1500 ns	94.9	108.7	78.2	89.2

Why look at rt-venue − W? By definition, rt-venue already includes W (the work the orchestrator did), so subtracting W leaves only hand-off + queueing cost.

As W grows, the lines split. At W = 1500, the gap between ArrayQueue (~77 µs) and crossbeam (~107 µs) is around 30 µs.

ArrayQueue is the lowest at every W. No allocation (vs ①), no wrapper (vs ②), and the consumer reads only one ring (vs ④). When load rises, those small differences get amplified by queueing.

The tail explosion near W = 1500 is the M/M/1 knee: the orchestrator’s service time gets close to the per-order arrival interval. ArrayQueue has a lower per-op cost, so it crosses the knee later.

Run-to-run p99 spread is ~7% across 5 repeats, and both contended MPSC sites (submit_q, completion_q) returned Full = 0 up to W = 1500. In-flight count matched Little’s law, so no run was overloaded. The ranking is not luck, not backpressure noise.

6. Conclusion

On AWS c7g.4xlarge (16 vCPU Graviton3, Linux, threads pinned to cores), comparing the 4 methods showed: as orchestrator service time grows, queue depth at submit_q rises, and ArrayQueue’s lower per-op cost compounds.

Based on this, I built the OE Engine’s busy-polled contended hot path (submit_q, completion_q) with ArrayQueue.

1. Non-blocking candidates#

① std::sync::mpsc#

② crossbeam-channel#

③ crossbeam_queue::ArrayQueue#

④ per-producer SPSC + mux#

2. Experiment setup#

3. Latency measurement#

4. Baseline results (orch_work = 0)#

5. orch_work sweep results#

6. Conclusion#

1. Non-blocking candidates

① std::sync::mpsc

② crossbeam-channel

③ crossbeam_queue::ArrayQueue

④ per-producer SPSC + mux

2. Experiment setup

3. Latency measurement

4. Baseline results (orch_work = 0)

5. orch_work sweep results

6. Conclusion