I started this trading engine with three constraints in mind:
- A centralized engine
- Supports multiple strategies
- Low latency
Strategies run as separate processes so they stay isolated and observable. Each venue gateway also runs on its own thread, so a slow or broken venue does not block other venues.
The Order Engine, simplified, has two layers: Orchestrator and Gateway. The real engine has more layers, but by role they group into these two. A gateway is created per (Venue, Credential). The Orchestrator is the central event loop every order passes through. Risk check, position tracking, and kill switch all sit in one place, and observability also ends there.
Strategies are isolated as processes. The Orchestrator and Gateways are threads in the same process, pinned to cores.
Following the pipeline, there are four hand-off points:
submit_q(strategy → orchestrator): N → 1, MPSCgateway_inbox(orchestrator → gateway X): 1 → 1, SPSC. One per gateway (M total)completion_q(gateway → orchestrator): M → 1, MPSCreturn_q(orchestrator → strategy Y): 1 → 1, SPSC. One per strategy (N total)
Contention happens at #1 and #3, the two MPSC spots. These are what this benchmark measures.
1. Non-blocking candidates
I picked 4 candidates. They are the common choices for MPSC hand-off in Rust.
| Candidate | Structure | lock-free? | bounded? |
|---|---|---|---|
| ① std::sync::mpsc | linked list, MPSC | both | unbounded |
| ② crossbeam-channel | array ring, MPMC | both | bounded |
| ③ ArrayQueue | array ring, MPMC | YES | bounded |
| ④ per-producer SPSC + mux | N SPSC rings | YES | bounded |
- lock-free: many threads can touch the same data structure at once without locks, safely. Done with atomic ops.
- atomic op: an op that cannot be cut in the middle. The CPU handles it in one shot.
- CAS (Compare-And-Swap): an atomic op that says "if memory holds X, change it to Y". If many threads try at once, only one wins; the rest see the failure.
- park: the OS putting a thread to sleep. The thread uses no CPU time and waits to be woken up.
- ring: a fixed-size array that wraps back to the start when it reaches the end. Marked as
↺in the tables and figures. - both (in the table): supports a lock-free call (fails fast) and a blocking call (parks until a slot opens).
YESmeans lock-free only.
① std::sync::mpsc
- Pros: Rust standard library (no external crate), unbounded (push is never refused)
- Cons (latency): allocates per block (not per push, but still a cost); pop chases pointers across the heap and hits cache misses
push② crossbeam-channel
- Pros: supports blocking, select, and timeout. The inner ring is the same algorithm family as ③
- Cons (latency): this bench uses only non-blocking push, but every call still goes through the wrapper machinery (notify, etc). You pay for features you do not use
pushErr(Full)③ crossbeam_queue::ArrayQueue
- Pros (latency): one call = one CAS. No wrapper, so push/pop cost is minimal
- Cons: no blocking call. The caller has to handle backpressure
pushErr(Full)④ per-producer SPSC + mux
- Pros (latency): producers never touch the same atomic, so push contention is zero and latency is stable
- Cons (latency): the consumer has to scan N queues each loop, so pop latency grows with N
2. Experiment setup
The bench mimics a real trading setup but does not talk to a real exchange. Each venue is abstracted as a single delay D. When an order arrives at a gateway, it is held for D ~ Uniform(5ms, 500ms) and then returned. This rolls the network and matching-engine round trip into one distribution.
To compare all 4 methods on the same input, I pre-sampled (per-strategy fire times + per-venue delays) once at the start and replayed it 4 times. Same effect as fixing the seed.
Setup: N (strategy) = 8, M (gateway) = 4, queue capacity = 8192, orders / run = 4M (= N × 500K), fire window = 10 s, D ~ U(5ms, 500ms), W sweep = 0 / 500 / 1000 / 1500 ns, repeats R = 5.
std::sync::mpsc
crossbeam-channel
ArrayQueue
SPSC + mux
All three thread types run sync code, busy-polling. The structure is almost the same: pop from input queue, do work, push to output queue.
Gateway has one difference. When an order arrives, it sets release_at = now + D and puts it in a min-heap. Each loop, if the heap top is at or before now, it pops and forwards.
# Strategy thread (one per strategy)
for (fire_ts, dest_venue) in pre_sampled_schedule:
while now() < fire_ts:
spin_hint()
order = build(fire_ts, dest_venue)
submit_q.try_push(order) # MPSC: contended hand-off (benchmark target)
# Orchestrator thread (single)
loop:
# consume from strategies
if order := submit_q.try_pop():
simulate_work(orch_work) # CPU work per order (see section 5)
gateway_inbox[order.venue].try_push(order) # SPSC: routed fan-out
# consume completions from gateways
if completion := completion_q.try_pop(): # MPSC: contended hand-off (benchmark target)
return_q[completion.owner].try_push(completion) # SPSC
# Gateway thread (one per venue)
heap = MinHeap() # ordered by release_at
loop:
# accept new orders
while order := gateway_inbox.try_pop():
order.release_at = now() + D
heap.push(order) # O(log N)
# release matured orders
while not heap.empty() and heap.peek().release_at <= now(): # O(1) peek
ready = heap.pop() # O(log N)
completion_q.try_push(ready) # MPSC: benchmark target
3. Latency measurement
Each order carries 8 timestamps, splitting the round trip into 7 segments.
4. Baseline results (orch_work = 0)
Start with the baseline where the orchestrator does no work (W = 0). Unit is µs.
| Candidate | p50 | p99 | p99.9 |
|---|---|---|---|
| ① std::sync::mpsc | 1.2 | 2.4 | 4.1 |
| ② crossbeam-channel | 1.3 | 2.3 | 3.7 |
| ③ ArrayQueue | 1.1 | 1.8 | 3.1 |
| ④ SPSC + mux | 1.1 | 2.1 | 3.3 |
ArrayQueue is the lowest, but all 4 sit around 2µs at p99. Practically tied.
The reason: the orchestrator is idle. It does almost nothing per order (pop → push). So queue depth stays near 0, and every measurement is really “hand-off on an empty queue”. The real difference between the 4 methods shows up when queue depth builds up.
5. orch_work sweep results
A real orchestrator does risk check, routing, and book updates per order. To mimic that cost, I added orch_work (W) and swept it over 0 / 500 / 1000 / 1500 ns. Look at rt-venue p99 by W (unit µs). The table below shows raw rt-venue p99; the chart shows rt-venue p99 − W.
| orch_work | ① std::mpsc | ② crossbeam | ③ ArrayQueue | ④ SPSC + mux |
|---|---|---|---|---|
| 0 ns | 2.4 | 2.3 | 1.8 | 2.1 |
| 500 ns | 6.1 | 6.1 | 5.1 | 6.0 |
| 1000 ns | 19.5 | 20.5 | 17.0 | 19.8 |
| 1500 ns | 94.9 | 108.7 | 78.2 | 89.2 |
Why look at rt-venue − W? By definition, rt-venue already includes W (the work the orchestrator did), so subtracting W leaves only hand-off + queueing cost.
As W grows, the lines split. At W = 1500, the gap between ArrayQueue (~77 µs) and crossbeam (~107 µs) is around 30 µs.
ArrayQueue is the lowest at every W. No allocation (vs ①), no wrapper (vs ②), and the consumer reads only one ring (vs ④). When load rises, those small differences get amplified by queueing.
The tail explosion near W = 1500 is the M/M/1 knee: the orchestrator’s service time gets close to the per-order arrival interval. ArrayQueue has a lower per-op cost, so it crosses the knee later.
Run-to-run p99 spread is ~7% across 5 repeats, and both contended MPSC sites (submit_q, completion_q) returned Full = 0 up to W = 1500. In-flight count matched Little’s law, so no run was overloaded. The ranking is not luck, not backpressure noise.
6. Conclusion
On AWS c7g.4xlarge (16 vCPU Graviton3, Linux, threads pinned to cores), comparing the 4 methods showed: as orchestrator service time grows, queue depth at submit_q rises, and ArrayQueue’s lower per-op cost compounds.
Based on this, I built the OE Engine’s busy-polled contended hot path (submit_q, completion_q) with ArrayQueue.