Real-time card classification pipeline

Rust + ONNX Runtime + CoreML/ANE — batched inference and stateful gatekeeper for Clash Royale hand tracking

Source: cr-perception/card-classifier/rust-inference-pipeline/src

A high-performance video inference pipeline that classifies the 5 cards in a Clash Royale player's hand in real-time. Originally prototyped in Python (PyTorch + OpenCV), then rewritten from scratch in Rust to maximize throughput through systems-level optimizations: batched ONNX inference, parallel preprocessing via Rayon, and hardware-accelerated execution on Apple's Neural Engine via CoreML.

The pipeline ships as two binaries with distinct architectures. Batch mode (batch.rs) runs all 5 card slots every frame — maximum throughput, fixed tensor shape, ideal for offline processing. Gatekeeper mode (gatekeeper.rs) adds a per-slot state machine that monitors pixel-level changes and skips inference on unchanged cards — designed for real-time deployment where compute is constrained or inference calls have cost.

Architecture two modes

Both modes share the same core stages — decode, preprocess, ONNX batched inference, overlay — but diverge on what gets sent to the model. Batch mode always sends all 5 slots. Gatekeeper mode filters through a per-slot state machine first, sending only the slots whose cards have visually changed.

OpenCV VideoCapture decode frame → BGR Mat → RgbImage iPhone Recording 5 fixed ROIs · 8,702 frames BATCH MODE · batch.rs All 5 slots → always inferred no filtering, no state tracking Rayon Parallel Preprocessing slot 1 slot 2 slot 3 slot 4 slot 5 [5, 3, 224, 224] fixed ONNX Runtime — Batched Forward CoreML EP → Apple Neural Engine Draw overlays + VideoWriter All overlays white 22.3 FPS · 44.9 ms/frame 43,510 inferences · 0 skipped GATEKEEPER MODE · gatekeeper.rs PER-SLOT STATE MACHINE LOCKED skip inference UNLOCKED run inference pixels Δ > 6% softmax ≥ 90% Fingerprint: 8×8 RGB grid · 192 floats · <1 μs compare unlocked slots only Rayon Parallel Preprocessing slot 2 slot 4 1,3,5 skipped [N, 3, 224, 224] variable ONNX Runtime — Batched Forward CoreML EP → Apple Neural Engine Draw overlays + VideoWriter Green = locked (cached) White = unlocked (inferred) 22,206 run · 21,304 skipped (49%) MobileNetV3-Small · ONNX (~2.1 MB) 97.89% val accuracy · 15 epochs VS
Figure 1 — Two pipeline architectures side by side. Left: batch mode sends all 5 slots every frame with a fixed tensor shape. Right: gatekeeper mode filters through a per-slot state machine, sending only changed slots with a variable batch size. Both share the same model, preprocessing, and output stages.

Key performance metrics measured

22.3
FPS (batch)
49%
Inference skipped
97.89%
Val accuracy
~9ms
Per-card latency

All measurements on Apple M-series silicon with CoreML/ANE acceleration, processing 8,702 frames of recorded gameplay.

MetricBatch ModeGatekeeper Mode
Wall time (inference)6 min 31 s12 min 27 s
ms / frame44.9 ms85.8 ms
Throughput22.3 FPS11.6 FPS
Inferences run43,51022,206
Inferences skipped021,304 (49%)

Why the gatekeeper is slower for offline video

CoreML's execution provider compiles optimized execution plans for fixed tensor shapes. In batch mode, every frame sends a consistent [5, 3, 224, 224] tensor that CoreML compiles once and runs on the ANE. With the gatekeeper, batch sizes vary dynamically ([1,…], [2,…], [3,…], etc.), forcing CoreML to recompile or fall back to CPU for unfamiliar shapes. The dynamic dispatch overhead more than cancels the 49% reduction in inference calls.

Where the gatekeeper wins: in a live real-time stream — where you're GPU-constrained, paying per API call, or sharing compute with other workloads — skipping 49% of inference calls is a meaningful reduction. The design targets real-time deployment, not batch video processing.

Demo recordings live results

Each mode was run on 3 different Clash Royale matches. The overlay border colors show the pipeline state in real time:

White border — UNLOCKED: inference running this frame
Green border — LOCKED: inference skipped, cached label reused

In batch mode, all overlays are always white (every slot runs inference every frame). In gatekeeper mode, slots transition between white and green as cards change.

Batch mode — fixed [5, 3, 224, 224], all slots every frame

Batch Track 1
Batch Track 2
Batch Track 3
All overlays white
22.3 FPS · 5 inferences/frame

Gatekeeper mode — stateful inference skipping

Gatekeeper Track 1
Gatekeeper Track 2
Gatekeeper Track 3
🚪
Green = locked (cached)
49% inferences skipped

Model performance training

Architecture: MobileNetV3-Small with a fine-tuned classifier head. Trained for 15 epochs with learning rate step-down at epoch 11. Best validation accuracy: 97.89% (epoch 11).

EpochTrain LossTrain AccVal LossVal Acc
10.50990.85%0.17796.32%
50.10097.69%0.48494.22%
110.08997.94%0.09297.89%
150.08698.00%0.09197.89%

The LR step-down at epoch 11 stabilizes validation loss and closes the train-val gap. The model is exported to ONNX format with dynamic batch axes, enabling both single and batched inference from the same file.

Systems optimizations (Python → Rust) engineering

1. Batched inference

The Python prototype ran 5 separate forward passes per frame. The Rust version stacks all preprocessed crops into a single NCHW tensor and executes one batched forward pass, eliminating per-call dispatch overhead, redundant kernel launches, and repeated memory allocation.

2. Parallel preprocessing with Rayon

Crop → resize → normalize is embarrassingly parallel — each operation reads from the shared source frame (read-only) and writes to its own contiguous buffer. Rayon's work-stealing thread pool distributes these across available cores with zero contention.

3. Hardware-accelerated execution (CoreML / ANE)

The ONNX Runtime session is configured with the CoreML Execution Provider, mapping supported operations to Apple's Neural Engine — a dedicated inference accelerator that runs independently of CPU and GPU, enabling ~9 ms per-card classification.

4. Gatekeeper state machine (inference pre-filter)

A per-slot state machine eliminates redundant compute. Each slot is reduced to a compact pixel fingerprint — an 8×8 grid of sampled RGB values (192 floats). Comparing two fingerprints is O(192) and takes sub-microsecond, vs the ~9 ms cost of a full inference call. Slots lock when confidence exceeds 90% and unlock when pixel delta exceeds 6%.

fingerprint = sample_8×8_grid(card_crop) → 192 floats → MAD comparison in <1 μs

5. Zero-overhead abstractions

Pixel rectangles precomputed once at startup via std::array::from_fn. Batch tensor assembled from contiguous Vec<f32> in CHW layout matching ONNX Runtime's expected memory format — no transposition or copying. Direct BGR→RGB via OpenCV's cvt_color with a single copy to RgbImage.

6. Model format conversion

PyTorch models cannot be loaded in Rust. The model is exported to ONNX format (Open Neural Network Exchange), an open standard that enables runtime-agnostic inference. The export preserves dynamic batch axes, allowing the same model file to handle both fixed [5,…] and variable [N,…] batch sizes.

Dependencies & build setup

CratePurpose
opencvVideo decode/encode, frame drawing
ortONNX Runtime bindings (with CoreML EP)
ndarrayTensor construction
rayonData-parallel preprocessing
imageCrop and resize operations
indicatifTerminal progress bar
serde_jsonClass label loading
anyhowError handling
# Build both binaries
cargo build --release

# Run batch mode (max throughput, fixed [5,3,224,224])
cargo run --release --bin batch

# Run gatekeeper mode (stateful inference skipping)
cargo run --release --bin gatekeeper

Part of clash-royale-suite — a research-oriented framework combining high-performance systems programming and deep learning for autonomous game intelligence.