100GB processing on a basic machine

A 100GB CSV dataset was processed on a single low-spec machine. The initial approach failed due to memory exhaustion. A revised workflow stabilised memory and reduced runtime without scaling hardware.

Data size~100GB raw CSV Machine8 cores, 16GB RAM, SSD GoalClean, type-fix, and aggregate data ConstraintNo cluster, no extra memory

What failed first

The initial pipeline attempted to load the full dataset into memory for cleaning. Memory usage increased steadily until the process terminated. Logging showed no single step was slow; failure came from accumulation.

Approach A (baseline)

Read full CSV into memory
Infer types automatically
Clean and aggregate in one pass

Why it failed
Memory was retained across operations. Garbage collection could not keep up. The dataset size exceeded practical in-memory limits.

Approach B (fix)

Read data in fixed-size chunks
Apply cleaning and type fixes during read
Write each chunk to Parquet immediately
Aggregate from cleaned Parquet files

Measured results

Runtime ~4 hours → ~55 minutes

Peak memory ~22GB → ~3.2GB

Stability No crashes across runs

Notes and trade-offs

Chunked processing increased disk I/O
Intermediate Parquet files required cleanup
Schema had to be fixed early

Reusable checklist

Confirm dataset size before loading
Choose conservative chunk size
Clean during read
Write output incrementally
Track peak memory

← Back to Case Notes Related guide Start Here

Also on this topic

Other views of format choice and scaling.

Guide: CSV vs Parquet: what changes at scale Tool: CSV → Parquet conversion plan

Evidence block

A short record that helps review and reproduce the result.

Data size~100GB raw CSV

Machine8 cores, 16GB RAM, SSD

GoalClean, index, and aggregate data

ConstraintNo cluster, no extra memory

Checks usedDataset size confirmed, chunk size controlled, output written incrementally, peak memory tracked