Tool • Checklist
Pipeline review checklist
Use this before running a pipeline on large data. It is designed to prevent common failures: memory growth, unstable types, slow steps, and silent output errors.
How to use
Tick items quickly. Any “No” is a reason to pause and fix first. One avoided rerun can save hours.
Tick items quickly. Any “No” is a reason to pause and fix first. One avoided rerun can save hours.
Tip: run a small test slice first, but keep the same steps as the full run.
Input sanity
- File size and row/column count are known Do not guess. Large files behave differently once parsed.
- Schema is decided (types are explicit) Avoid repeated type inference and mixed types across chunks.
- Bad rows strategy exists Decide: drop, quarantine, or fix. Do not crash on first bad record.
Memory and compute
- Processing is chunked or streamed Full-load should be the exception, not the default.
- Peak memory is monitored Track memory during run. Growth over time means retention.
- Chunk size is conservative Smaller chunks are slower but stable. Stability wins.
Output correctness
- Output is written incrementally Write per chunk. Avoid holding large intermediate objects.
- There is a validation step Row counts, null rates, and basic distributions should be checked.
- Reproducibility is possible Record versions, parameters, and run settings. Avoid “it worked once”.
A minimal validation is enough: confirm row count is plausible, confirm key columns have expected types, and confirm no major shift in null rates after cleaning.
Related reading
Related
A few pages that pair well with this checklist.