Tool • Checklist

Pipeline review checklist

Use this before running a pipeline on large data. It is designed to prevent common failures: memory growth, unstable types, slow steps, and silent output errors.

How to use
Tick items quickly. Any “No” is a reason to pause and fix first. One avoided rerun can save hours.

Tip: run a small test slice first, but keep the same steps as the full run.

Input sanity

File size and row/column count are known Do not guess. Large files behave differently once parsed.
Schema is decided (types are explicit) Avoid repeated type inference and mixed types across chunks.
Bad rows strategy exists Decide: drop, quarantine, or fix. Do not crash on first bad record.

Memory and compute

Processing is chunked or streamed Full-load should be the exception, not the default.
Peak memory is monitored Track memory during run. Growth over time means retention.
Chunk size is conservative Smaller chunks are slower but stable. Stability wins.

Output correctness

Output is written incrementally Write per chunk. Avoid holding large intermediate objects.
There is a validation step Row counts, null rates, and basic distributions should be checked.
Reproducibility is possible Record versions, parameters, and run settings. Avoid “it worked once”.

A minimal validation is enough: confirm row count is plausible, confirm key columns have expected types, and confirm no major shift in null rates after cleaning.

A few pages that pair well with this checklist.

Guides Case note: schema drift Send a problem statement

← Back to Tools Start Here

Pipeline review checklist

Input sanity

Memory and compute

Output correctness

Related