High-throughput temporal deduplication engine

Rust — multi-threaded batch processing for large-scale CV dataset sanitization

Source: cr-data-engine/Deduper/src/main.rs

A data processing engine written in Rust that eliminates temporal redundancy from massive video-frame datasets. Raw gameplay footage produces hundreds of thousands of frames, but consecutive frames where the card hand hasn't changed are identical from a training perspective — they add storage cost and training noise without adding information. This engine identifies and removes those duplicates at 221 FPS across 440,000 frames.

The problem is simple: compare each frame's card slots to the previous frame's, and keep only the frames where something changed. The engineering challenge is doing this across 440,000 images spanning 268 directories on an external SSD, fast enough that disk I/O — not compute — is the bottleneck.

Architecture pipeline

The engine has four stages: recursive directory discovery, parallel feature extraction across all CPU cores, sequential temporal comparison within each video folder, and zero-copy file output at native SSD bandwidth.

Figure 1 — Four-stage pipeline. Stage 2 is embarrassingly parallel (rayon); Stage 3 is sequential per-folder to preserve temporal ordering; Stage 4 bypasses image codecs entirely.

Performance measured

440K

Frames processed

61.1%

Duplicates dropped

221

FPS throughput

33min

Wall time

Benchmark executed on a dataset of raw gameplay footage spanning 268 directories on an external SSD, including mixed iPhone and iPad aspect ratios.

Metric	Value
Total frames analyzed	440,056
Processing time	33 minutes (1,990 seconds)
Throughput	~221 FPS (including disk I/O)
Temporal duplicates dropped	269,214 (61.1%)
Unique frames saved	170,842
Directories traversed	268

The 221 FPS throughput includes the full pipeline: image decode from SSD, pixel-level feature extraction, temporal comparison, and file copy back to SSD. The bottleneck is disk I/O on the external SSD, not CPU — Rayon keeps all cores saturated while individual threads block on reads.

Systems engineering optimizations

1. Zero-copy memory management

Standard Python pipelines allocate new memory for every image crop. The engine uses image::crop_imm, which creates lightweight immutable views into the original image buffer — a mathematical window, not a copy. Across 440K frames × 5 crops each, this eliminates 2.2 million unnecessary allocations and prevents RAM thrashing.

2. Work-stealing multi-threading (Rayon)

Python's Global Interpreter Lock (GIL) restricts image processing to a single core. Rayon's .par_iter() distributes frames across all logical cores using a work-stealing scheduler. When one thread blocks on an SSD read, others continue processing — maintaining near-100% CPU utilization throughout the run.

3. Per-folder temporal isolation

Frames are grouped by parent directory (via HashMap<PathBuf, Vec<PathBuf>>) before temporal filtering. This ensures the N-to-N−1 comparison only happens between frames from the same video. Without this, a match-ending frame from one video could be incorrectly compared to the opening frame of the next, corrupting the deduplication logic.

4. OS-level file copy

Unique frames are written to the output directory using std::fs::copy, which issues a direct syscall to duplicate the raw binary blob on disk. The engine never decodes or re-encodes a JPEG — it operates at the maximum bandwidth of the storage hardware. The output directory tree mirrors the input structure.

5. Adaptive device layout detection

The dataset contains mixed iPhone (0.46 aspect ratio) and iPad (0.75 aspect ratio) recordings. Rather than hardcoding pixel coordinates, the engine computes the input aspect ratio and selects the closest matching device layout at runtime. Card slot ROIs are defined as percentage-based bounding boxes, making the system resolution-independent.

layout = argmin(|aspect_ratio(frame) − aspect_ratio(device)|) over {iPhone, iPad}

Deduplication logic algorithm

The core comparison reduces each frame to a 5-element color fingerprint: the average RGB value of each card slot. Two consecutive frames are considered duplicates if the Manhattan distance across all 5 slots falls below a threshold.

duplicate(N, N−1) = Σ_slot=1..5 |R_N − R_N−1| + |G_N − G_N−1| + |B_N − B_N−1| ≤ 50

The threshold of 50 was tuned empirically. Lower values (stricter) preserve more frames but miss duplicates caused by video compression artifacts. Higher values (looser) over-prune and risk dropping genuine card changes. At 50, the engine achieves a 61% reduction while preserving all visually distinct hand states.

This is deliberately a lightweight feature — average color, not a neural embedding or perceptual hash. The goal is to run at SSD speed, not model accuracy. Downstream, the MobileNetV3 classifier handles the actual card identification; this engine just ensures it trains on non-redundant data.

Build & run setup

Crate	Purpose
`image`	JPEG decode, zero-copy cropping, pixel access
`rayon`	Work-stealing parallel iteration
`walkdir`	Recursive directory traversal

# Build in release mode (critical for performance — debug builds are 10-20x slower)
cargo build --release

# Run the deduplication engine
cargo run --release

Configure input/output paths at the top of src/main.rs. The engine creates a {input_dir}_cleaned/ output directory automatically, preserving the full subfolder structure.

Part of clash-royale-suite — the data-centric AI core responsible for dataset sanitization, super-resolution upscaling (Real-ESRGAN), and multi-stage augmentation.