I started this trading engine with three constraints in mind:

  • A centralized engine
  • Supports multiple strategies
  • Low latency
Benchmark code repo: github.com/junhjang/nonblocking-dispatch-bench

Strategies run as separate processes so they stay isolated and observable. Each venue gateway also runs on its own thread, so a slow or broken venue does not block other venues.

strategy× Norchestrator× 1gateway× Msubmit_q · N→1gateway_inbox · 1→Mcompletion_q · M→1return_q · 1→Nt1t2t3t4t5t6t7t8

The Order Engine, simplified, has two layers: Orchestrator and Gateway. The real engine has more layers, but by role they group into these two. A gateway is created per (Venue, Credential). The Orchestrator is the central event loop every order passes through. Risk check, position tracking, and kill switch all sit in one place, and observability also ends there.

Strategies are isolated as processes. The Orchestrator and Gateways are threads in the same process, pinned to cores.

Following the pipeline, there are four hand-off points:

  • submit_q (strategy → orchestrator): N → 1, MPSC
  • gateway_inbox (orchestrator → gateway X): 1 → 1, SPSC. One per gateway (M total)
  • completion_q (gateway → orchestrator): M → 1, MPSC
  • return_q (orchestrator → strategy Y): 1 → 1, SPSC. One per strategy (N total)

Contention happens at #1 and #3, the two MPSC spots. These are what this benchmark measures.

1. Non-blocking candidates

I picked 4 candidates. They are the common choices for MPSC hand-off in Rust.

CandidateStructurelock-free?bounded?
① std::sync::mpsclinked list, MPSCbothunbounded
② crossbeam-channelarray ring, MPMCbothbounded
③ ArrayQueuearray ring, MPMCYESbounded
④ per-producer SPSC + muxN SPSC ringsYESbounded
Terms you will see often
  • lock-free: many threads can touch the same data structure at once without locks, safely. Done with atomic ops.
  • atomic op: an op that cannot be cut in the middle. The CPU handles it in one shot.
  • CAS (Compare-And-Swap): an atomic op that says "if memory holds X, change it to Y". If many threads try at once, only one wins; the rest see the failure.
  • park: the OS putting a thread to sleep. The thread uses no CPU time and waits to be woken up.
  • ring: a fixed-size array that wraps back to the start when it reaches the end. Marked as in the tables and figures.
  • both (in the table): supports a lock-free call (fails fast) and a blocking call (parks until a slot opens). YES means lock-free only.

① std::sync::mpsc

linked nodes (heap-allocated), unbounded
P1
P2
P3
C
  • Pros: Rust standard library (no external crate), unbounded (push is never refused)
  • Cons (latency): allocates per block (not per push, but still a cost); pop chases pointers across the heap and hits cache misses
1
P calls push
2
alloc new node from heap (per-block, amortized)
3
atomic link to list tail (CAS)

② crossbeam-channel

bounded ring inside channel wrapper
P1
P2
P3
channel wrapper
·
·
·
+ park / select / timeout
C
  • Pros: supports blocking, select, and timeout. The inner ring is the same algorithm family as ③
  • Cons (latency): this bench uses only non-blocking push, but every call still goes through the wrapper machinery (notify, etc). You pay for features you do not use
1
P calls push
2
CAS on ring tail (claim slot + write value)
3
wrapper notify machinery (check sleeping receivers, unused but always traversed)
4
return Ok or Err(Full)

③ crossbeam_queue::ArrayQueue

bare bounded ring (same shape, no wrapper)
P1
P2
P3
·
·
·
CAS only, no park
C
  • Pros (latency): one call = one CAS. No wrapper, so push/pop cost is minimal
  • Cons: no blocking call. The caller has to handle backpressure
1
P calls push
2
CAS on ring tail (claim slot + write value)
3
return Ok or Err(Full)

④ per-producer SPSC + mux

N bounded SPSC rings, consumer scans all
P1
ring 1
P2
ring 2
P3
ring 3
C scans N
  • Pros (latency): producers never touch the same atomic, so push contention is zero and latency is stable
  • Cons (latency): the consumer has to scan N queues each loop, so pop latency grows with N
1
each P does atomic store to own ring tail (no CAS needed, SPSC)
2
isolated from other producers' atomics, so push contention is zero
3
consumer peeks N rings sequentially (round-robin)

2. Experiment setup

The bench mimics a real trading setup but does not talk to a real exchange. Each venue is abstracted as a single delay D. When an order arrives at a gateway, it is held for D ~ Uniform(5ms, 500ms) and then returned. This rolls the network and matching-engine round trip into one distribution.

To compare all 4 methods on the same input, I pre-sampled (per-strategy fire times + per-venue delays) once at the start and replayed it 4 times. Same effect as fixing the seed.

Setup: N (strategy) = 8, M (gateway) = 4, queue capacity = 8192, orders / run = 4M (= N × 500K), fire window = 10 s, D ~ U(5ms, 500ms), W sweep = 0 / 500 / 1000 / 1500 ns, repeats R = 5.

pre-sampled schedule
[ (fire_ts, strategy, venue, D), ... ]
↓ replayed identically
run 1
std::sync::mpsc
run 2
crossbeam-channel
run 3
ArrayQueue
run 4
SPSC + mux
repeated R times (R = 5), reported as median

All three thread types run sync code, busy-polling. The structure is almost the same: pop from input queue, do work, push to output queue.

Gateway has one difference. When an order arrives, it sets release_at = now + D and puts it in a min-heap. Each loop, if the heap top is at or before now, it pops and forwards.

# Strategy thread (one per strategy)
for (fire_ts, dest_venue) in pre_sampled_schedule:
    while now() < fire_ts:
        spin_hint()
    order = build(fire_ts, dest_venue)
    submit_q.try_push(order)        # MPSC: contended hand-off (benchmark target)

# Orchestrator thread (single)
loop:
    # consume from strategies
    if order := submit_q.try_pop():
        simulate_work(orch_work)                          # CPU work per order (see section 5)
        gateway_inbox[order.venue].try_push(order)        # SPSC: routed fan-out

    # consume completions from gateways
    if completion := completion_q.try_pop():              # MPSC: contended hand-off (benchmark target)
        return_q[completion.owner].try_push(completion)   # SPSC

# Gateway thread (one per venue)
heap = MinHeap()       # ordered by release_at
loop:
    # accept new orders
    while order := gateway_inbox.try_pop():
        order.release_at = now() + D
        heap.push(order)                                  # O(log N)

    # release matured orders
    while not heap.empty() and heap.peek().release_at <= now():   # O(1) peek
        ready = heap.pop()                                # O(log N)
        completion_q.try_push(ready)                      # MPSC: benchmark target

3. Latency measurement

Each order carries 8 timestamps, splitting the round trip into 7 segments.

t1strategy_send
submit_latencyMPSC bench
t2orch_recv_submit
orch_pre_dispatchinternal
t3orch_dispatch
dispatch_latencyfixed SPSC
t4gateway_recv
venue_servicevenue D
t5gateway_done
completion_latencyMPSC bench
t6orch_recv_completion
orch_post_processinternal
t7orch_return
return_latencyfixed SPSC
t8strategy_recv
round_trip = t8 − t1
rt-venue = round_trip − venue_service = (t8 − t1) − (t5 − t4)

4. Baseline results (orch_work = 0)

Start with the baseline where the orchestrator does no work (W = 0). Unit is µs.

Candidatep50p99p99.9
① std::sync::mpsc1.22.44.1
② crossbeam-channel1.32.33.7
ArrayQueue1.11.83.1
④ SPSC + mux1.12.13.3

ArrayQueue is the lowest, but all 4 sit around 2µs at p99. Practically tied.

The reason: the orchestrator is idle. It does almost nothing per order (pop → push). So queue depth stays near 0, and every measurement is really “hand-off on an empty queue”. The real difference between the 4 methods shows up when queue depth builds up.

5. orch_work sweep results

A real orchestrator does risk check, routing, and book updates per order. To mimic that cost, I added orch_work (W) and swept it over 0 / 500 / 1000 / 1500 ns. Look at rt-venue p99 by W (unit µs). The table below shows raw rt-venue p99; the chart shows rt-venue p99 − W.

orch_work① std::mpsc② crossbeam③ ArrayQueue④ SPSC + mux
0 ns2.42.31.82.1
500 ns6.16.15.16.0
1000 ns19.520.517.019.8
1500 ns94.9108.778.289.2

Why look at rt-venue − W? By definition, rt-venue already includes W (the work the orchestrator did), so subtracting W leaves only hand-off + queueing cost.

rt-venue p99 − W (handoff + queueing) vs orch_workstd::sync::mpsccrossbeam-channelArrayQueueSPSC + mux020406080100120rt-venue p99 − W (µs)050010001500orch_work W (ns)

As W grows, the lines split. At W = 1500, the gap between ArrayQueue (~77 µs) and crossbeam (~107 µs) is around 30 µs.

ArrayQueue is the lowest at every W. No allocation (vs ①), no wrapper (vs ②), and the consumer reads only one ring (vs ④). When load rises, those small differences get amplified by queueing.

The tail explosion near W = 1500 is the M/M/1 knee: the orchestrator’s service time gets close to the per-order arrival interval. ArrayQueue has a lower per-op cost, so it crosses the knee later.

Run-to-run p99 spread is ~7% across 5 repeats, and both contended MPSC sites (submit_q, completion_q) returned Full = 0 up to W = 1500. In-flight count matched Little’s law, so no run was overloaded. The ranking is not luck, not backpressure noise.

6. Conclusion

On AWS c7g.4xlarge (16 vCPU Graviton3, Linux, threads pinned to cores), comparing the 4 methods showed: as orchestrator service time grows, queue depth at submit_q rises, and ArrayQueue’s lower per-op cost compounds.

Based on this, I built the OE Engine’s busy-polled contended hot path (submit_q, completion_q) with ArrayQueue.