Large Data Notes Guides • Benchmarks • Templates

Tool • Checklist

RAG evaluation checklist

A quick checklist to test a retrieval + answer system (RAG). The aim is to reduce wrong answers by checking sources, retrieval, grounding, and failure handling.

1) Data sources

Sources are listed and owned (no unknown scrape)
Each source has a freshness rule (how often it updates)
Access rules are clear (private vs public)
Text is cleaned (headers, footers, repeated boilerplate removed)

2) Index and chunks

Chunk size is chosen on purpose (not default)
Chunks keep meaning (no mid-table or mid-sentence splits)
Each chunk stores: source, section, date, and a stable id
Duplicates are reduced (near-duplicate text does not flood retrieval)

3) Retrieval checks

Top-k results include the right source for known queries
Empty retrieval is handled (no answer or ask for more detail)
Recall risk is tracked (how often the right text is missing)
Query rewrite is measured (does it help or harm retrieval)

4) Grounding checks

Answers cite the retrieved text (or quote small spans)
When evidence is weak, the system says it is unsure
Confident text without evidence is blocked
Output format rules are enforced (tables, bullet lists, JSON, etc)

5) Safety and refusal rules

Restricted topics trigger refusal or safe redirect
Personal data is not exposed in responses
Prompt injection is handled (ignore unsafe instructions in retrieved text)
Tool calls have allow-lists and hard limits

6) Cost, latency, and fallbacks

Latency budget exists and is measured (p50/p95)
Token use is capped (max input and output)
Caching is used where safe
Fallback exists (smaller model, keyword search, human hand-off)

7) Evaluation set

At least 20–50 real questions exist for testing
Questions include hard cases (ambiguous, rare, long context)
Each question has an expected source or a valid “cannot answer”
Results are tracked over time (before/after changes)

Next steps

Services Send a problem statement Browse Guides