Case Note

Schema drift: why pipelines break silently

Schema drift happens when an input changes shape or meaning. The pipeline may still run, but outputs become wrong. This note lists the usual failure path, the checks that catch it, and a simple fix pattern.

Quick summary

TriggerUpstream adds, removes, renames, or re-types fields

Common failureSilent nulls, wrong parsing, wrong joins, wrong aggregates

First signalRow count shifts, new null rates, new category values, type warnings

Fix patternSchema contract + validation checks + controlled change

What failed first

The first break is often not a crash. A type change (for example, int to string) may still parse, but it changes sorting, grouping, and join keys. A new column may shift positional parsing. A renamed field may become null.

Join key stops matching (more unmatched rows)
Null rate increases in one or more fields
New category values appear (or old ones disappear)
Aggregates change without a business reason

Root causes

No schema contract (expected fields and types are not written down)
Loose parsing rules (best-effort inference on each run)
Joining on text without normalisation
Backfills that mix old and new formats in the same partition
Multiple producers writing the same dataset

Fix pattern that works

Freeze an expected schema (fields, types, allowed nulls, allowed ranges)
Validate before heavy work (fail fast or route to a quarantine table)
Version changes (new schema version, clear cut-over date)
Backfill safely (do not mix formats inside one partition)
Keep a rollback path (previous output or a known-good snapshot)

Checks that catch schema drift

Schema check: expected field set and expected types
Row count check: day-over-day change within a bound
Null-rate check on key fields
Join quality check: match rate and duplicate key rate
Value checks: allowed ranges and allowed category sets
Sample diff: a small stable sample compared across runs

Checks are cheap. Run them before expensive joins and wide shuffles.

Evidence to record

This is the minimum record needed to reproduce and review the change.

Input location and partition used
Schema before and after (fields and types)
Which checks fired (and by how much)
Impact on output (rows, nulls, join match rate)
Fix applied (contract, cast rules, version change)
Rollback plan (what to restore if needed)

Next steps

Pipeline review checklist Sampling validation checklist Send a problem statement

Back to Case Notes Browse Guides