A RAG evaluation checklist is different from a model benchmark. You are testing the whole app path: user question, retrieval, ranking, context packing, generation, citations, refusal behavior, and edge cases your users will find on day one.
I learned this the annoying way. A demo can look solid with ten friendly questions, then break when a user asks about a renamed feature or two similar docs with different dates.
RAG makes hallucinations less likely, but it doesn't remove them. It gives the model something to read. Your job is to prove the app retrieved the right thing, used it correctly, and admitted when the answer wasn't in the docs.
Start with the user-visible failure modes
Most teams start by asking, "Is the answer good?" That's too vague for a release gate.
For RAG apps, split quality into failures a user can feel:
- The app retrieves the wrong document.
- The app retrieves the right document but ignores it.
- The app answers from memory instead of context.
- The citation points to a page that doesn't support the claim.
- The answer is technically true but too incomplete to be useful.
- The app refuses questions it should answer.
- The app answers questions it should refuse.
That list is more useful than a single quality score. It tells you where to debug.
If retrieval failed, changing the prompt won't fix much. If retrieval worked but the answer drifted, the generator or citation policy needs work. If the app behaves well on normal questions but fails on changed docs, you need regression tests.
This is why RAG evaluation belongs at the app level. If you only test the base model, you miss the parts your team owns.
Build a test set that looks like real usage
A good RAG test set starts with real questions or realistic product tasks. Don't let the test set become a collection of clean FAQ prompts that nobody would type.
I like to group cases by intent:
| Test group | Example case | What it catches |
|---|---|---|
| Direct lookup | "What is the refund window for annual plans?" | Basic retrieval and answer accuracy |
| Multi-hop lookup | "Can a workspace admin export invoices after downgrading?" | Context stitching across docs |
| Conflicting docs | Old policy page versus current policy page | Freshness, ranking, and citation choice |
| Missing answer | A feature that isn't documented | Refusal and honesty |
| Ambiguous wording | "Can I share this with my team?" | Clarifying behavior and over-answering |
| Citation stress | Claim that needs a specific paragraph | Source support, not just source presence |
Start small. Fifty carefully chosen cases beat five hundred synthetic ones you never inspect.
Each test case should include the user question, expected behavior, acceptable source documents, and a short note. For factual questions, write a reference answer. For questions that should be refused, write the refusal rule.
Synthetic test generation can help fill gaps, and tools like LlamaIndex can generate questions from your data. Still, humans should review the final set because synthetic questions often sound too polite.
Check retrieval before you check generation
RAG failures often hide inside the answer. The model may write a convincing response even when retrieval was poor.
So test retrieval separately.
For each query, log the top retrieved chunks, document IDs, scores, metadata filters, and final context. Then grade retrieval with a few plain checks:
- Did at least one expected source appear in the top results?
- Was the best source ranked high enough to survive context trimming?
- Did stale or duplicate content outrank the current doc?
- Did metadata filters remove useful documents?
- Did chunking split the answer away from its heading or date?
LangSmith's RAG evaluation guide separates retrieval relevance from answer groundedness, which is the right mental model. Retrieval relevance asks whether the docs match the question. Groundedness asks whether the answer matches the retrieved docs.
Don't blend those too early. A single score can tell you the app got worse, but it won't tell you whether the retriever, ranker, chunker, or prompt caused the regression.
Grade answers against the retrieved context
After retrieval passes, evaluate the answer itself. I care about four checks before I care about style.
| Answer check | Pass condition | Common failure |
|---|---|---|
| Correctness | The answer matches the reference answer or policy | Looks right but misses an exception |
| Grounding | Every factual claim is supported by retrieved context | Model adds a detail from memory |
| Relevance | The answer addresses the actual user question | Generic summary instead of answer |
| Usefulness | The answer gives enough detail for the user to act | Accurate but too thin |
Ragas names related metrics like faithfulness, answer relevancy, context precision, and context recall. LlamaIndex documents similar evaluators for correctness, faithfulness, context relevancy, and answer relevancy. The names vary, but the split is practical: retrieve the right context, then prove the answer stays inside it.
Use LLM-as-judge carefully. It helps triage many cases after every prompt or retrieval change. But don't treat judge output as truth.
Keep a human-reviewed gold set for the cases that matter most. Use judges for broad regression signals. When a judge and a human disagree, update the rubric or mark the case as hard instead of pretending the score is exact.
Treat citations as evidence, not decoration
A RAG answer with citations can still be wrong. Sometimes the citation points to the right page but the claim isn't supported there. Sometimes the answer cites a general docs page when it needed a pricing table, changelog, or policy paragraph.
Your checklist should grade citations on support:
- Does each citation open to a real source?
- Does the cited source contain the exact claim?
- Is the cited source current?
- Does the answer avoid uncited factual claims?
- Are multiple citations used only when they add support?
For app developers, this matters more than it sounds. Users trust answers with links. Bad citations make wrong answers more persuasive.
I prefer citation tests that ask for narrow claims. For example, don't only ask "How does billing work?" Ask "What happens to unused seats when I downgrade mid-cycle?" That forces the app to cite the exact billing rule, not a nearby docs page.
Add regression checks to your release flow
RAG apps change even when the model doesn't. Your docs change. Embeddings change. Chunking changes. Ranking weights change. A product manager edits one paragraph and suddenly the answer to a support question gets worse.
That is why evaluation needs to run as a regression suite.
- System prompts or answer format instructions.
- Embedding model or retrieval parameters.
- Chunk size, overlap, or document parser.
- Reranking logic.
- Citation formatting.
- The source corpus itself.
For model-specific upgrade work, keep that evaluation separate from app-level RAG testing. If you are swapping models, use a model upgrade checklist like the GPT-5.6 evaluation checklist. For RAG, the question is narrower: did this app still retrieve and cite the right evidence after the change?
If your app uses the Vercel AI SDK, tests can isolate model behavior with controlled outputs instead of calling a live model for every unit test. That won't replace RAG evals, but it helps keep UI and tool-call behavior stable. Teams planning SDK migrations can keep compatibility work in the AI SDK 7 upgrade checklist while this checklist stays focused on whether users can trust the answers.
Use a simple scoring rubric
A useful rubric is boring. Boring rubrics are easier to apply twice.
For each test case, score these fields from 0 to 2:
| Field | 0 | 1 | 2 |
|---|---|---|---|
| Retrieval | Expected source missing | Source present but low or mixed with junk | Right source appears high |
| Grounding | Unsupported claims | Mostly supported with one weak claim | Fully supported by context |
| Citation | Missing or wrong source | Source is related but not exact | Citation supports the claim |
| Answer | Wrong or unsafe | Partly useful | Correct and usable |
| Refusal | Answers when it shouldn't, or refuses valid question | Unclear boundary | Correct boundary behavior |
Then define release gates:
- No critical test case can score 0 on retrieval or grounding.
- Citation score must be 2 for legal, billing, security, and medical-like content.
- Overall pass rate must stay above your current baseline.
- Any new failure in a previously passing critical case blocks release.
The exact numbers matter less than consistency. If you change the rubric every week, the trend line becomes theater.
Inspect failures like product bugs
When a case fails, label the failure before changing the prompt.
Use a small taxonomy:
- Corpus issue: the source doc is missing, stale, duplicated, or unclear.
- Parsing issue: tables, PDFs, headings, or code blocks were ingested badly.
- Chunking issue: the answer was split from the context that explains it.
- Retrieval issue: the right chunk exists but wasn't found.
- Ranking issue: the right chunk was found but buried.
- Generation issue: the model ignored or stretched the context.
- Citation issue: the answer was right but the source link was weak.
- Product issue: the UI hid uncertainty or made citations hard to check.
This turns evals into engineering work instead of vague debate. Save every failed case as a regression test before fixing it, or the same bug comes back under a new prompt, model, or ingestion run.
The pre-launch RAG evaluation checklist
Before real users see the app, I want this checklist filled in:
| Area | Check |
|---|---|
| Test set | At least 50 reviewed cases across direct lookup, multi-hop, missing answers, stale docs, and citation stress |
| Retrieval logs | Top chunks, scores, metadata, and final model context are stored for every run |
| Retrieval quality | Expected sources appear high enough to survive context limits |
| Answer quality | Correctness, grounding, relevance, and usefulness are scored separately |
| Citations | Each factual claim has a source that supports it directly |
| Refusals | The app refuses questions outside the corpus without inventing an answer |
| Regression | The suite runs before prompt, retrieval, model, SDK, or corpus changes ship |
| Failure taxonomy | Failures are labeled so the team knows where to fix them |
| Human review | Critical cases have human-reviewed expected answers and source mappings |
| Release gate | The team has written thresholds for blocking release |
This is cheaper than discovering, through users, that your support bot quotes old pricing or your internal assistant cites a policy that doesn't say what the answer claims.
Final take
RAG evaluation is mostly about discipline. Keep retrieval, grounding, citations, and regressions separate long enough to see which part broke.
A small, reviewed test set with clear failure labels will beat a giant spreadsheet of vague scores. Start there. Then add automation where it saves time without hiding the mistakes.



