Large Data Notes Guides • Benchmarks • Templates

Case Note

Schema drift: why pipelines break silently

Schema drift happens when an input changes shape or meaning. The pipeline may still run, but outputs become wrong. This note lists the usual failure path, the checks that catch it, and a simple fix pattern.

Quick summary

TriggerUpstream adds, removes, renames, or re-types fields
Common failureSilent nulls, wrong parsing, wrong joins, wrong aggregates
First signalRow count shifts, new null rates, new category values, type warnings
Fix patternSchema contract + validation checks + controlled change

What failed first

The first break is often not a crash. A type change (for example, int to string) may still parse, but it changes sorting, grouping, and join keys. A new column may shift positional parsing. A renamed field may become null.

  • Join key stops matching (more unmatched rows)
  • Null rate increases in one or more fields
  • New category values appear (or old ones disappear)
  • Aggregates change without a business reason

Root causes

  • No schema contract (expected fields and types are not written down)
  • Loose parsing rules (best-effort inference on each run)
  • Joining on text without normalisation
  • Backfills that mix old and new formats in the same partition
  • Multiple producers writing the same dataset

Fix pattern that works

  1. Freeze an expected schema (fields, types, allowed nulls, allowed ranges)
  2. Validate before heavy work (fail fast or route to a quarantine table)
  3. Version changes (new schema version, clear cut-over date)
  4. Backfill safely (do not mix formats inside one partition)
  5. Keep a rollback path (previous output or a known-good snapshot)

Checks that catch schema drift

  • Schema check: expected field set and expected types
  • Row count check: day-over-day change within a bound
  • Null-rate check on key fields
  • Join quality check: match rate and duplicate key rate
  • Value checks: allowed ranges and allowed category sets
  • Sample diff: a small stable sample compared across runs

Checks are cheap. Run them before expensive joins and wide shuffles.

Evidence to record

This is the minimum record needed to reproduce and review the change.

  • Input location and partition used
  • Schema before and after (fields and types)
  • Which checks fired (and by how much)
  • Impact on output (rows, nulls, join match rate)
  • Fix applied (contract, cast rules, version change)
  • Rollback plan (what to restore if needed)

Next steps