100GB processing on a basic machine
A 100GB CSV dataset was processed on a single low-spec machine. The initial approach failed due to memory exhaustion. A revised workflow stabilised memory and reduced runtime without scaling hardware.
Data size~100GB raw CSV
Machine8 cores, 16GB RAM, SSD
GoalClean, type-fix, and aggregate data
ConstraintNo cluster, no extra memory
What failed first
The initial pipeline attempted to load the full dataset into memory for cleaning. Memory usage increased steadily until the process terminated. Logging showed no single step was slow; failure came from accumulation.
Approach A (baseline)
- Read full CSV into memory
- Infer types automatically
- Clean and aggregate in one pass
Why it failed
Memory was retained across operations. Garbage collection could not keep up. The dataset size exceeded practical in-memory limits.
Memory was retained across operations. Garbage collection could not keep up. The dataset size exceeded practical in-memory limits.
Approach B (fix)
- Read data in fixed-size chunks
- Apply cleaning and type fixes during read
- Write each chunk to Parquet immediately
- Aggregate from cleaned Parquet files
Measured results
Runtime
~4 hours → ~55 minutes
Peak memory
~22GB → ~3.2GB
Stability
No crashes across runs
Notes and trade-offs
- Chunked processing increased disk I/O
- Intermediate Parquet files required cleanup
- Schema had to be fixed early
Reusable checklist
- Confirm dataset size before loading
- Choose conservative chunk size
- Clean during read
- Write output incrementally
- Track peak memory
Also on this topic
Other views of format choice and scaling.
Evidence block
A short record that helps review and reproduce the result.
Data size~100GB raw CSV
Machine8 cores, 16GB RAM, SSD
GoalClean, index, and aggregate data
ConstraintNo cluster, no extra memory
Checks usedDataset size confirmed, chunk size controlled, output written incrementally, peak memory tracked