Large Data Notes Guides • Benchmarks • Templates

100GB processing on a basic machine

A 100GB CSV dataset was processed on a single low-spec machine. The initial approach failed due to memory exhaustion. A revised workflow stabilised memory and reduced runtime without scaling hardware.

Data size~100GB raw CSV Machine8 cores, 16GB RAM, SSD GoalClean, type-fix, and aggregate data ConstraintNo cluster, no extra memory

What failed first

The initial pipeline attempted to load the full dataset into memory for cleaning. Memory usage increased steadily until the process terminated. Logging showed no single step was slow; failure came from accumulation.

Approach A (baseline)

Why it failed
Memory was retained across operations. Garbage collection could not keep up. The dataset size exceeded practical in-memory limits.

Approach B (fix)

Measured results

Runtime ~4 hours → ~55 minutes
Peak memory ~22GB → ~3.2GB
Stability No crashes across runs

Notes and trade-offs

Reusable checklist

Evidence block

A short record that helps review and reproduce the result.

Data size~100GB raw CSV
Machine8 cores, 16GB RAM, SSD
GoalClean, index, and aggregate data
ConstraintNo cluster, no extra memory
Checks usedDataset size confirmed, chunk size controlled, output written incrementally, peak memory tracked