Guide

CSV vs Parquet: what changes at scale

CSV works well until data grows large. Parquet promises speed and efficiency, but switching formats introduces trade-offs. This guide explains what actually changes at scale and how to choose safely.

The problem

Format decisions are often made early and forgotten. At small sizes this rarely matters. At large sizes, format choice affects memory usage, processing speed, storage cost, and failure rates.

Why CSV breaks down

Entire rows must be read even when few columns are needed
Type inference happens repeatedly
Compression is limited and inefficient
Parallel processing is harder

Observation
CSV scales in file size, but not in processing efficiency.

What Parquet changes

Column-based storage reduces unnecessary reads
Data types are stored explicitly
Compression works per column
Partial reads become efficient

Side-by-side comparison

CSV

Human-readable
Easy to generate
Slow for analytics
High memory overhead

Parquet

Binary format
Requires tooling
Fast column access
Lower memory usage

When CSV is still fine

One-time transfers
Small to medium files
Manual inspection
Early exploration

When Parquet is the safer choice

Repeated analytics
Partial column reads
Memory-constrained systems
Long-term storage

Failure modes

Writing Parquet too early without stable schema
Assuming Parquet fixes poor data modelling
Mixing incompatible readers and writers

Rule of thumb
Use CSV to move data. Use Parquet to work with data.

Decision checklist

Is the data processed more than once?
Are only some columns usually needed?
Is memory usage a constraint?
Is the schema stable?
Is tooling available?

Next steps

Related pages that help choose and migrate formats safely.

CSV to Parquet plan Pipeline review checklist Send a problem statement

Also on this topic

Other views of format choice and scaling.

Tool: CSV → Parquet conversion plan Case Note: 100GB processing on a basic machine

← Back to Guides Guide 1 Start Here