Large Data Notes Guides • Benchmarks • Templates

Tool • Checklist

Sampling validation checklist

Sampling reduces cost and time, but it can change behaviour. Use this checklist to confirm a sample is safe before using it for analysis, modelling, or pipeline testing.

How to use
Run these checks on the sample versus the full data (or a trusted reference slice). If full data is unavailable, compare multiple samples and check stability.
Tip: set a fixed random seed and record the sampling method and parameters.

Coverage

  • Key groups are present For important categories, confirm none disappear in the sample.
  • Rare cases are retained If rare cases matter, use stratified or weighted sampling.
  • Time coverage is preserved If data is time-based, confirm all periods are represented.

Distributions

  • Numeric columns match roughly Compare mean, median, spread, and extreme values.
  • Categorical proportions are stable Large shifts often mean sampling missed structure.
  • Null rates are similar Null shifts often change downstream logic and model signals.

Stability across runs

  • Results do not swing across samples Generate multiple samples and check variance in outputs.
  • Downstream outputs look consistent Run the same pipeline step(s): counts, joins, aggregates, or a small model.
  • Sampling method is recorded Record seed, method, stratification fields, and sample size.

If results change a lot between samples, the sample is not safe yet. Increase sample size, switch sampling method, or define stratification keys.

Quick decision rules

  • If rare cases matter: avoid pure random sampling; use stratified or weighted sampling.
  • If time matters: sample across time windows, not only across rows.
  • If outputs swing: increase size or change method until results stabilise.

Related

Links that help keep sampling stable across runs.