Rust + ONNX Runtime + CoreML/ANE — batched inference and stateful gatekeeper for Clash Royale hand tracking
Source: cr-perception/card-classifier/rust-inference-pipeline/src
A high-performance video inference pipeline that classifies the 5 cards in a Clash Royale player's hand in real-time. Originally prototyped in Python (PyTorch + OpenCV), then rewritten from scratch in Rust to maximize throughput through systems-level optimizations: batched ONNX inference, parallel preprocessing via Rayon, and hardware-accelerated execution on Apple's Neural Engine via CoreML.
The pipeline ships as two binaries with distinct architectures. Batch mode (batch.rs) runs all 5 card slots every frame — maximum throughput, fixed tensor shape, ideal for offline processing. Gatekeeper mode (gatekeeper.rs) adds a per-slot state machine that monitors pixel-level changes and skips inference on unchanged cards — designed for real-time deployment where compute is constrained or inference calls have cost.
Both modes share the same core stages — decode, preprocess, ONNX batched inference, overlay — but diverge on what gets sent to the model. Batch mode always sends all 5 slots. Gatekeeper mode filters through a per-slot state machine first, sending only the slots whose cards have visually changed.
All measurements on Apple M-series silicon with CoreML/ANE acceleration, processing 8,702 frames of recorded gameplay.
| Metric | Batch Mode | Gatekeeper Mode |
|---|---|---|
| Wall time (inference) | 6 min 31 s | 12 min 27 s |
| ms / frame | 44.9 ms | 85.8 ms |
| Throughput | 22.3 FPS | 11.6 FPS |
| Inferences run | 43,510 | 22,206 |
| Inferences skipped | 0 | 21,304 (49%) |
CoreML's execution provider compiles optimized execution plans for fixed tensor shapes. In batch mode, every frame sends a consistent [5, 3, 224, 224] tensor that CoreML compiles once and runs on the ANE. With the gatekeeper, batch sizes vary dynamically ([1,…], [2,…], [3,…], etc.), forcing CoreML to recompile or fall back to CPU for unfamiliar shapes. The dynamic dispatch overhead more than cancels the 49% reduction in inference calls.
Where the gatekeeper wins: in a live real-time stream — where you're GPU-constrained, paying per API call, or sharing compute with other workloads — skipping 49% of inference calls is a meaningful reduction. The design targets real-time deployment, not batch video processing.
Each mode was run on 3 different Clash Royale matches. The overlay border colors show the pipeline state in real time:
In batch mode, all overlays are always white (every slot runs inference every frame). In gatekeeper mode, slots transition between white and green as cards change.
Architecture: MobileNetV3-Small with a fine-tuned classifier head. Trained for 15 epochs with learning rate step-down at epoch 11. Best validation accuracy: 97.89% (epoch 11).
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 1 | 0.509 | 90.85% | 0.177 | 96.32% |
| 5 | 0.100 | 97.69% | 0.484 | 94.22% |
| 11 | 0.089 | 97.94% | 0.092 | 97.89% |
| 15 | 0.086 | 98.00% | 0.091 | 97.89% |
The LR step-down at epoch 11 stabilizes validation loss and closes the train-val gap. The model is exported to ONNX format with dynamic batch axes, enabling both single and batched inference from the same file.
The Python prototype ran 5 separate forward passes per frame. The Rust version stacks all preprocessed crops into a single NCHW tensor and executes one batched forward pass, eliminating per-call dispatch overhead, redundant kernel launches, and repeated memory allocation.
Crop → resize → normalize is embarrassingly parallel — each operation reads from the shared source frame (read-only) and writes to its own contiguous buffer. Rayon's work-stealing thread pool distributes these across available cores with zero contention.
The ONNX Runtime session is configured with the CoreML Execution Provider, mapping supported operations to Apple's Neural Engine — a dedicated inference accelerator that runs independently of CPU and GPU, enabling ~9 ms per-card classification.
A per-slot state machine eliminates redundant compute. Each slot is reduced to a compact pixel fingerprint — an 8×8 grid of sampled RGB values (192 floats). Comparing two fingerprints is O(192) and takes sub-microsecond, vs the ~9 ms cost of a full inference call. Slots lock when confidence exceeds 90% and unlock when pixel delta exceeds 6%.
Pixel rectangles precomputed once at startup via std::array::from_fn. Batch tensor assembled from contiguous Vec<f32> in CHW layout matching ONNX Runtime's expected memory format — no transposition or copying. Direct BGR→RGB via OpenCV's cvt_color with a single copy to RgbImage.
PyTorch models cannot be loaded in Rust. The model is exported to ONNX format (Open Neural Network Exchange), an open standard that enables runtime-agnostic inference. The export preserves dynamic batch axes, allowing the same model file to handle both fixed [5,…] and variable [N,…] batch sizes.
| Crate | Purpose |
|---|---|
opencv | Video decode/encode, frame drawing |
ort | ONNX Runtime bindings (with CoreML EP) |
ndarray | Tensor construction |
rayon | Data-parallel preprocessing |
image | Crop and resize operations |
indicatif | Terminal progress bar |
serde_json | Class label loading |
anyhow | Error handling |
# Build both binaries
cargo build --release
# Run batch mode (max throughput, fixed [5,3,224,224])
cargo run --release --bin batch
# Run gatekeeper mode (stateful inference skipping)
cargo run --release --bin gatekeeper
Part of clash-royale-suite — a research-oriented framework combining high-performance systems programming and deep learning for autonomous game intelligence.