Guide
CSV vs Parquet: what changes at scale
CSV works well until data grows large. Parquet promises speed and efficiency, but switching formats introduces trade-offs. This guide explains what actually changes at scale and how to choose safely.
The problem
Format decisions are often made early and forgotten. At small sizes this rarely matters. At large sizes, format choice affects memory usage, processing speed, storage cost, and failure rates.
Why CSV breaks down
- Entire rows must be read even when few columns are needed
- Type inference happens repeatedly
- Compression is limited and inefficient
- Parallel processing is harder
Observation
CSV scales in file size, but not in processing efficiency.
CSV scales in file size, but not in processing efficiency.
What Parquet changes
- Column-based storage reduces unnecessary reads
- Data types are stored explicitly
- Compression works per column
- Partial reads become efficient
Side-by-side comparison
CSV
- Human-readable
- Easy to generate
- Slow for analytics
- High memory overhead
Parquet
- Binary format
- Requires tooling
- Fast column access
- Lower memory usage
When CSV is still fine
- One-time transfers
- Small to medium files
- Manual inspection
- Early exploration
When Parquet is the safer choice
- Repeated analytics
- Partial column reads
- Memory-constrained systems
- Long-term storage
Failure modes
- Writing Parquet too early without stable schema
- Assuming Parquet fixes poor data modelling
- Mixing incompatible readers and writers
Rule of thumb
Use CSV to move data. Use Parquet to work with data.
Use CSV to move data. Use Parquet to work with data.
Decision checklist
- Is the data processed more than once?
- Are only some columns usually needed?
- Is memory usage a constraint?
- Is the schema stable?
- Is tooling available?
Next steps
Related pages that help choose and migrate formats safely.
Also on this topic
Other views of format choice and scaling.