Tool • Checklist
Sampling validation checklist
Sampling reduces cost and time, but it can change behaviour. Use this checklist to confirm a sample is safe before using it for analysis, modelling, or pipeline testing.
How to use
Run these checks on the sample versus the full data (or a trusted reference slice). If full data is unavailable, compare multiple samples and check stability.
Run these checks on the sample versus the full data (or a trusted reference slice). If full data is unavailable, compare multiple samples and check stability.
Tip: set a fixed random seed and record the sampling method and parameters.
Coverage
- Key groups are present For important categories, confirm none disappear in the sample.
- Rare cases are retained If rare cases matter, use stratified or weighted sampling.
- Time coverage is preserved If data is time-based, confirm all periods are represented.
Distributions
- Numeric columns match roughly Compare mean, median, spread, and extreme values.
- Categorical proportions are stable Large shifts often mean sampling missed structure.
- Null rates are similar Null shifts often change downstream logic and model signals.
Stability across runs
- Results do not swing across samples Generate multiple samples and check variance in outputs.
- Downstream outputs look consistent Run the same pipeline step(s): counts, joins, aggregates, or a small model.
- Sampling method is recorded Record seed, method, stratification fields, and sample size.
If results change a lot between samples, the sample is not safe yet. Increase sample size, switch sampling method, or define stratification keys.
Quick decision rules
- If rare cases matter: avoid pure random sampling; use stratified or weighted sampling.
- If time matters: sample across time windows, not only across rows.
- If outputs swing: increase size or change method until results stabilise.
Related reading
Related
Links that help keep sampling stable across runs.
Also on this topic
Other views of sampling under real limits.