High-throughput temporal deduplication engine

Rust — multi-threaded batch processing for large-scale CV dataset sanitization

Source: cr-data-engine/Deduper/src/main.rs

A data processing engine written in Rust that eliminates temporal redundancy from massive video-frame datasets. Raw gameplay footage produces hundreds of thousands of frames, but consecutive frames where the card hand hasn't changed are identical from a training perspective — they add storage cost and training noise without adding information. This engine identifies and removes those duplicates at 221 FPS across 440,000 frames.

The problem is simple: compare each frame's card slots to the previous frame's, and keep only the frames where something changed. The engineering challenge is doing this across 440,000 images spanning 268 directories on an external SSD, fast enough that disk I/O — not compute — is the bottleneck.

Architecture pipeline

The engine has four stages: recursive directory discovery, parallel feature extraction across all CPU cores, sequential temporal comparison within each video folder, and zero-copy file output at native SSD bandwidth.

STAGE 1 Recursive Directory Discovery WalkDir crawls the dataset tree → groups JPEGs by parent folder → preserves per-video temporal ordering 268 directories · 440,056 frames · iPhone + iPad mixed aspect ratios HashMap<PathBuf, Vec<PathBuf>> STAGE 2 Parallel Feature Extraction (Rayon) Each frame processed on a rayon thread: detect aspect ratio → select device layout → crop 5 card slots → average RGB per slot Thread 1 crop_imm → avg_color Thread 2 crop_imm → avg_color Thread 3 crop_imm → avg_color Thread N crop_imm → avg_color Vec<FrameData> sorted by frame_number STAGE 3 Temporal Redundancy Filter (sequential) Compare frame N to frame N−1: Manhattan distance across all 5 card slots. If total Δ ≤ threshold → drop as duplicate. Per-folder isolation ensures frames from different videos are never compared. Threshold: 50 (RGB Manhattan distance). Vec<usize> unique frame indices STAGE 4 Zero-Copy File Output std::fs::copy issues direct OS-level syscalls — no JPEG decode/re-encode. Raw binary blob duplicated at max SSD bandwidth. Directory tree structure preserved in output. 170,842 unique frames written to {dataset}_cleaned/. crop_imm = zero-copy view into RAM 61% of all frames dropped as duplicates
Figure 1 — Four-stage pipeline. Stage 2 is embarrassingly parallel (rayon); Stage 3 is sequential per-folder to preserve temporal ordering; Stage 4 bypasses image codecs entirely.

Performance measured

440K
Frames processed
61.1%
Duplicates dropped
221
FPS throughput
33min
Wall time

Benchmark executed on a dataset of raw gameplay footage spanning 268 directories on an external SSD, including mixed iPhone and iPad aspect ratios.

MetricValue
Total frames analyzed440,056
Processing time33 minutes (1,990 seconds)
Throughput~221 FPS (including disk I/O)
Temporal duplicates dropped269,214 (61.1%)
Unique frames saved170,842
Directories traversed268

The 221 FPS throughput includes the full pipeline: image decode from SSD, pixel-level feature extraction, temporal comparison, and file copy back to SSD. The bottleneck is disk I/O on the external SSD, not CPU — Rayon keeps all cores saturated while individual threads block on reads.

Systems engineering optimizations

1. Zero-copy memory management

Standard Python pipelines allocate new memory for every image crop. The engine uses image::crop_imm, which creates lightweight immutable views into the original image buffer — a mathematical window, not a copy. Across 440K frames × 5 crops each, this eliminates 2.2 million unnecessary allocations and prevents RAM thrashing.

2. Work-stealing multi-threading (Rayon)

Python's Global Interpreter Lock (GIL) restricts image processing to a single core. Rayon's .par_iter() distributes frames across all logical cores using a work-stealing scheduler. When one thread blocks on an SSD read, others continue processing — maintaining near-100% CPU utilization throughout the run.

3. Per-folder temporal isolation

Frames are grouped by parent directory (via HashMap<PathBuf, Vec<PathBuf>>) before temporal filtering. This ensures the N-to-N−1 comparison only happens between frames from the same video. Without this, a match-ending frame from one video could be incorrectly compared to the opening frame of the next, corrupting the deduplication logic.

4. OS-level file copy

Unique frames are written to the output directory using std::fs::copy, which issues a direct syscall to duplicate the raw binary blob on disk. The engine never decodes or re-encodes a JPEG — it operates at the maximum bandwidth of the storage hardware. The output directory tree mirrors the input structure.

5. Adaptive device layout detection

The dataset contains mixed iPhone (0.46 aspect ratio) and iPad (0.75 aspect ratio) recordings. Rather than hardcoding pixel coordinates, the engine computes the input aspect ratio and selects the closest matching device layout at runtime. Card slot ROIs are defined as percentage-based bounding boxes, making the system resolution-independent.

layout = argmin(|aspect_ratio(frame) − aspect_ratio(device)|) over {iPhone, iPad}

Deduplication logic algorithm

The core comparison reduces each frame to a 5-element color fingerprint: the average RGB value of each card slot. Two consecutive frames are considered duplicates if the Manhattan distance across all 5 slots falls below a threshold.

duplicate(N, N−1) = Σslot=1..5 |RN − RN−1| + |GN − GN−1| + |BN − BN−1| ≤ 50

The threshold of 50 was tuned empirically. Lower values (stricter) preserve more frames but miss duplicates caused by video compression artifacts. Higher values (looser) over-prune and risk dropping genuine card changes. At 50, the engine achieves a 61% reduction while preserving all visually distinct hand states.

This is deliberately a lightweight feature — average color, not a neural embedding or perceptual hash. The goal is to run at SSD speed, not model accuracy. Downstream, the MobileNetV3 classifier handles the actual card identification; this engine just ensures it trains on non-redundant data.

Build & run setup

CratePurpose
imageJPEG decode, zero-copy cropping, pixel access
rayonWork-stealing parallel iteration
walkdirRecursive directory traversal
# Build in release mode (critical for performance — debug builds are 10-20x slower)
cargo build --release

# Run the deduplication engine
cargo run --release

Configure input/output paths at the top of src/main.rs. The engine creates a {input_dir}_cleaned/ output directory automatically, preserving the full subfolder structure.


Part of clash-royale-suite — the data-centric AI core responsible for dataset sanitization, super-resolution upscaling (Real-ESRGAN), and multi-stage augmentation.