Large Data Notes Guides • Benchmarks • Templates

Tool • Checklist

Pipeline review checklist

Use this before running a pipeline on large data. It is designed to prevent common failures: memory growth, unstable types, slow steps, and silent output errors.

How to use
Tick items quickly. Any “No” is a reason to pause and fix first. One avoided rerun can save hours.
Tip: run a small test slice first, but keep the same steps as the full run.

Input sanity

  • File size and row/column count are known Do not guess. Large files behave differently once parsed.
  • Schema is decided (types are explicit) Avoid repeated type inference and mixed types across chunks.
  • Bad rows strategy exists Decide: drop, quarantine, or fix. Do not crash on first bad record.

Memory and compute

  • Processing is chunked or streamed Full-load should be the exception, not the default.
  • Peak memory is monitored Track memory during run. Growth over time means retention.
  • Chunk size is conservative Smaller chunks are slower but stable. Stability wins.

Output correctness

  • Output is written incrementally Write per chunk. Avoid holding large intermediate objects.
  • There is a validation step Row counts, null rates, and basic distributions should be checked.
  • Reproducibility is possible Record versions, parameters, and run settings. Avoid “it worked once”.

A minimal validation is enough: confirm row count is plausible, confirm key columns have expected types, and confirm no major shift in null rates after cleaning.

Related

A few pages that pair well with this checklist.