Rust — multi-threaded batch processing for large-scale CV dataset sanitization
Source: cr-data-engine/Deduper/src/main.rs
A data processing engine written in Rust that eliminates temporal redundancy from massive video-frame datasets. Raw gameplay footage produces hundreds of thousands of frames, but consecutive frames where the card hand hasn't changed are identical from a training perspective — they add storage cost and training noise without adding information. This engine identifies and removes those duplicates at 221 FPS across 440,000 frames.
The problem is simple: compare each frame's card slots to the previous frame's, and keep only the frames where something changed. The engineering challenge is doing this across 440,000 images spanning 268 directories on an external SSD, fast enough that disk I/O — not compute — is the bottleneck.
The engine has four stages: recursive directory discovery, parallel feature extraction across all CPU cores, sequential temporal comparison within each video folder, and zero-copy file output at native SSD bandwidth.
Benchmark executed on a dataset of raw gameplay footage spanning 268 directories on an external SSD, including mixed iPhone and iPad aspect ratios.
| Metric | Value |
|---|---|
| Total frames analyzed | 440,056 |
| Processing time | 33 minutes (1,990 seconds) |
| Throughput | ~221 FPS (including disk I/O) |
| Temporal duplicates dropped | 269,214 (61.1%) |
| Unique frames saved | 170,842 |
| Directories traversed | 268 |
The 221 FPS throughput includes the full pipeline: image decode from SSD, pixel-level feature extraction, temporal comparison, and file copy back to SSD. The bottleneck is disk I/O on the external SSD, not CPU — Rayon keeps all cores saturated while individual threads block on reads.
Standard Python pipelines allocate new memory for every image crop. The engine uses image::crop_imm, which creates lightweight immutable views into the original image buffer — a mathematical window, not a copy. Across 440K frames × 5 crops each, this eliminates 2.2 million unnecessary allocations and prevents RAM thrashing.
Python's Global Interpreter Lock (GIL) restricts image processing to a single core. Rayon's .par_iter() distributes frames across all logical cores using a work-stealing scheduler. When one thread blocks on an SSD read, others continue processing — maintaining near-100% CPU utilization throughout the run.
Frames are grouped by parent directory (via HashMap<PathBuf, Vec<PathBuf>>) before temporal filtering. This ensures the N-to-N−1 comparison only happens between frames from the same video. Without this, a match-ending frame from one video could be incorrectly compared to the opening frame of the next, corrupting the deduplication logic.
Unique frames are written to the output directory using std::fs::copy, which issues a direct syscall to duplicate the raw binary blob on disk. The engine never decodes or re-encodes a JPEG — it operates at the maximum bandwidth of the storage hardware. The output directory tree mirrors the input structure.
The dataset contains mixed iPhone (0.46 aspect ratio) and iPad (0.75 aspect ratio) recordings. Rather than hardcoding pixel coordinates, the engine computes the input aspect ratio and selects the closest matching device layout at runtime. Card slot ROIs are defined as percentage-based bounding boxes, making the system resolution-independent.
The core comparison reduces each frame to a 5-element color fingerprint: the average RGB value of each card slot. Two consecutive frames are considered duplicates if the Manhattan distance across all 5 slots falls below a threshold.
The threshold of 50 was tuned empirically. Lower values (stricter) preserve more frames but miss duplicates caused by video compression artifacts. Higher values (looser) over-prune and risk dropping genuine card changes. At 50, the engine achieves a 61% reduction while preserving all visually distinct hand states.
This is deliberately a lightweight feature — average color, not a neural embedding or perceptual hash. The goal is to run at SSD speed, not model accuracy. Downstream, the MobileNetV3 classifier handles the actual card identification; this engine just ensures it trains on non-redundant data.
| Crate | Purpose |
|---|---|
image | JPEG decode, zero-copy cropping, pixel access |
rayon | Work-stealing parallel iteration |
walkdir | Recursive directory traversal |
# Build in release mode (critical for performance — debug builds are 10-20x slower)
cargo build --release
# Run the deduplication engine
cargo run --release
Configure input/output paths at the top of src/main.rs. The engine creates a {input_dir}_cleaned/ output directory automatically, preserving the full subfolder structure.
Part of clash-royale-suite — the data-centric AI core responsible for dataset sanitization, super-resolution upscaling (Real-ESRGAN), and multi-stage augmentation.