Tool • Checklist
RAG evaluation checklist
A quick checklist to test a retrieval + answer system (RAG). The aim is to reduce wrong answers by checking sources, retrieval, grounding, and failure handling.
1) Data sources
- Sources are listed and owned (no unknown scrape)
- Each source has a freshness rule (how often it updates)
- Access rules are clear (private vs public)
- Text is cleaned (headers, footers, repeated boilerplate removed)
2) Index and chunks
- Chunk size is chosen on purpose (not default)
- Chunks keep meaning (no mid-table or mid-sentence splits)
- Each chunk stores: source, section, date, and a stable id
- Duplicates are reduced (near-duplicate text does not flood retrieval)
3) Retrieval checks
- Top-k results include the right source for known queries
- Empty retrieval is handled (no answer or ask for more detail)
- Recall risk is tracked (how often the right text is missing)
- Query rewrite is measured (does it help or harm retrieval)
4) Grounding checks
- Answers cite the retrieved text (or quote small spans)
- When evidence is weak, the system says it is unsure
- Confident text without evidence is blocked
- Output format rules are enforced (tables, bullet lists, JSON, etc)
5) Safety and refusal rules
- Restricted topics trigger refusal or safe redirect
- Personal data is not exposed in responses
- Prompt injection is handled (ignore unsafe instructions in retrieved text)
- Tool calls have allow-lists and hard limits
6) Cost, latency, and fallbacks
- Latency budget exists and is measured (p50/p95)
- Token use is capped (max input and output)
- Caching is used where safe
- Fallback exists (smaller model, keyword search, human hand-off)
7) Evaluation set
- At least 20–50 real questions exist for testing
- Questions include hard cases (ambiguous, rare, long context)
- Each question has an expected source or a valid “cannot answer”
- Results are tracked over time (before/after changes)