Real-Time Card Classification Pipeline

Rust + ONNX Runtime + CoreML/ANE — batched inference and stateful gatekeeper for Clash Royale hand tracking

A high-performance video inference pipeline that classifies the 5 cards in a Clash Royale player's hand in real-time. Originally prototyped in Python (PyTorch + OpenCV), then rewritten from scratch in Rust to maximize throughput through systems-level optimizations: batched ONNX inference, parallel preprocessing via Rayon, and hardware-accelerated execution on Apple's Neural Engine via CoreML.

The pipeline ships as two binaries with distinct architectures. Batch mode (batch.rs) runs all 5 card slots every frame — maximum throughput, fixed tensor shape, ideal for offline processing. Gatekeeper mode (gatekeeper.rs) adds a per-slot state machine that monitors pixel-level changes and skips inference on unchanged cards — designed for real-time deployment where compute is constrained or inference calls have cost.

Architecture two modes

Both modes share the same core stages — decode, preprocess, ONNX batched inference, overlay — but diverge on what gets sent to the model. Batch mode always sends all 5 slots. Gatekeeper mode filters through a per-slot state machine first, sending only the slots whose cards have visually changed.

Figure 1 — Two pipeline architectures side by side. Left: batch mode sends all 5 slots every frame with a fixed tensor shape. Right: gatekeeper mode filters through a per-slot state machine, sending only changed slots with a variable batch size. Both share the same model, preprocessing, and output stages.

Key performance metrics measured

22.3

FPS (batch)

49%

Inference skipped

97.89%

Val accuracy

~9ms

Per-card latency

All measurements on Apple M-series silicon with CoreML/ANE acceleration, processing 8,702 frames of recorded gameplay.

Why the gatekeeper is slower for offline video

Metric	Batch Mode	Gatekeeper Mode
Wall time (inference)	6 min 31 s	12 min 27 s
ms / frame	44.9 ms	85.8 ms
Throughput	22.3 FPS	11.6 FPS
Inferences run	43,510	22,206
Inferences skipped	0	21,304 (49%)

CoreML's execution provider compiles optimized execution plans for fixed tensor shapes. In batch mode, every frame sends a consistent [5, 3, 224, 224] tensor that CoreML compiles once and runs on the ANE. With the gatekeeper, batch sizes vary dynamically ([1,…], [2,…], [3,…], etc.), forcing CoreML to recompile or fall back to CPU for unfamiliar shapes. The dynamic dispatch overhead more than cancels the 49% reduction in inference calls.

Where the gatekeeper wins: in a live real-time stream — where you're GPU-constrained, paying per API call, or sharing compute with other workloads — skipping 49% of inference calls is a meaningful reduction. The design targets real-time deployment, not batch video processing.

Demo recordings live results

Each mode was run on 3 different Clash Royale matches. The overlay border colors show the pipeline state in real time:

White border — UNLOCKED: inference running this frame

Green border — LOCKED: inference skipped, cached label reused

In batch mode, all overlays are always white (every slot runs inference every frame). In gatekeeper mode, slots transition between white and green as cards change.

Batch mode — fixed [5, 3, 224, 224], all slots every frame

Batch Track 3

⚡

All overlays white
22.3 FPS · 5 inferences/frame

Gatekeeper mode — stateful inference skipping

Gatekeeper Track 3

🚪

Green = locked (cached)
49% inferences skipped

Model performance training

Architecture: MobileNetV3-Small with a fine-tuned classifier head. Trained for 15 epochs with learning rate step-down at epoch 11. Best validation accuracy: 97.89% (epoch 11).

The LR step-down at epoch 11 stabilizes validation loss and closes the train-val gap. The model is exported to ONNX format with dynamic batch axes, enabling both single and batched inference from the same file.

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
1	0.509	90.85%	0.177	96.32%
5	0.100	97.69%	0.484	94.22%
11	0.089	97.94%	0.092	97.89%
15	0.086	98.00%	0.091	97.89%

Systems optimizations (Python → Rust) engineering

1. Batched inference

The Python prototype ran 5 separate forward passes per frame. The Rust version stacks all preprocessed crops into a single NCHW tensor and executes one batched forward pass, eliminating per-call dispatch overhead, redundant kernel launches, and repeated memory allocation.

2. Parallel preprocessing with Rayon

Crop → resize → normalize is embarrassingly parallel — each operation reads from the shared source frame (read-only) and writes to its own contiguous buffer. Rayon's work-stealing thread pool distributes these across available cores with zero contention.

3. Hardware-accelerated execution (CoreML / ANE)

The ONNX Runtime session is configured with the CoreML Execution Provider, mapping supported operations to Apple's Neural Engine — a dedicated inference accelerator that runs independently of CPU and GPU, enabling ~9 ms per-card classification.

4. Gatekeeper state machine (inference pre-filter)

A per-slot state machine eliminates redundant compute. Each slot is reduced to a compact pixel fingerprint — an 8×8 grid of sampled RGB values (192 floats). Comparing two fingerprints is O(192) and takes sub-microsecond, vs the ~9 ms cost of a full inference call. Slots lock when confidence exceeds 90% and unlock when pixel delta exceeds 6%.

fingerprint = sample_8×8_grid(card_crop) → 192 floats → MAD comparison in <1 μs

5. Zero-overhead abstractions

Pixel rectangles precomputed once at startup via std::array::from_fn. Batch tensor assembled from contiguous Vec<f32> in CHW layout matching ONNX Runtime's expected memory format — no transposition or copying. Direct BGR→RGB via OpenCV's cvt_color with a single copy to RgbImage.

6. Model format conversion

PyTorch models cannot be loaded in Rust. The model is exported to ONNX format (Open Neural Network Exchange), an open standard that enables runtime-agnostic inference. The export preserves dynamic batch axes, allowing the same model file to handle both fixed [5,…] and variable [N,…] batch sizes.

Dependencies & build setup

Crate	Purpose
`opencv`	Video decode/encode, frame drawing
`ort`	ONNX Runtime bindings (with CoreML EP)
`ndarray`	Tensor construction
`rayon`	Data-parallel preprocessing
`image`	Crop and resize operations
`indicatif`	Terminal progress bar
`serde_json`	Class label loading
`anyhow`	Error handling

Part of clash-royale-suite — a research-oriented framework combining high-performance systems programming and deep learning for autonomous game intelligence.