Karya Semi
HomeBlogSearchTagsCategoriesAboutContact
Karya Semi

Less noise. More notes.

HomeBlogAboutContactPrivacy PolicyDisclaimer

© 2026 Karya Semi. All rights reserved.

XGitHubLinkedIn
  1. Home
  2. /Categories
  3. /AI

RAG Evaluation Checklist for AI Apps Before Users See Them

A practical RAG evaluation checklist for app developers: test retrieval, citations, answer grounding, regressions, and release gates before shipping AI features.

Dian Rijal Asyrof/June 30, 2026/7 min read
Illustration for RAG Evaluation Checklist for AI Apps Before Users See Them
Advertisement

A RAG evaluation checklist is different from a model benchmark. You are testing the whole app path: user question, retrieval, ranking, context packing, generation, citations, refusal behavior, and edge cases your users will find on day one.

I learned this the annoying way. A demo can look solid with ten friendly questions, then break when a user asks about a renamed feature or two similar docs with different dates.

RAG makes hallucinations less likely, but it doesn't remove them. It gives the model something to read. Your job is to prove the app retrieved the right thing, used it correctly, and admitted when the answer wasn't in the docs.

Start with the user-visible failure modes

Most teams start by asking, "Is the answer good?" That's too vague for a release gate.

For RAG apps, split quality into failures a user can feel:

  • The app retrieves the wrong document.
  • The app retrieves the right document but ignores it.
  • The app answers from memory instead of context.
  • The citation points to a page that doesn't support the claim.
  • The answer is technically true but too incomplete to be useful.
  • The app refuses questions it should answer.
  • The app answers questions it should refuse.

That list is more useful than a single quality score. It tells you where to debug.

If retrieval failed, changing the prompt won't fix much. If retrieval worked but the answer drifted, the generator or citation policy needs work. If the app behaves well on normal questions but fails on changed docs, you need regression tests.

This is why RAG evaluation belongs at the app level. If you only test the base model, you miss the parts your team owns.

Build a test set that looks like real usage

A good RAG test set starts with real questions or realistic product tasks. Don't let the test set become a collection of clean FAQ prompts that nobody would type.

I like to group cases by intent:

Test groupExample caseWhat it catches
Direct lookup"What is the refund window for annual plans?"Basic retrieval and answer accuracy
Multi-hop lookup"Can a workspace admin export invoices after downgrading?"Context stitching across docs
Conflicting docsOld policy page versus current policy pageFreshness, ranking, and citation choice
Missing answerA feature that isn't documentedRefusal and honesty
Ambiguous wording"Can I share this with my team?"Clarifying behavior and over-answering
Citation stressClaim that needs a specific paragraphSource support, not just source presence

Start small. Fifty carefully chosen cases beat five hundred synthetic ones you never inspect.

Each test case should include the user question, expected behavior, acceptable source documents, and a short note. For factual questions, write a reference answer. For questions that should be refused, write the refusal rule.

Synthetic test generation can help fill gaps, and tools like LlamaIndex can generate questions from your data. Still, humans should review the final set because synthetic questions often sound too polite.

Check retrieval before you check generation

RAG failures often hide inside the answer. The model may write a convincing response even when retrieval was poor.

So test retrieval separately.

For each query, log the top retrieved chunks, document IDs, scores, metadata filters, and final context. Then grade retrieval with a few plain checks:

  • Did at least one expected source appear in the top results?
  • Was the best source ranked high enough to survive context trimming?
  • Did stale or duplicate content outrank the current doc?
  • Did metadata filters remove useful documents?
  • Did chunking split the answer away from its heading or date?

LangSmith's RAG evaluation guide separates retrieval relevance from answer groundedness, which is the right mental model. Retrieval relevance asks whether the docs match the question. Groundedness asks whether the answer matches the retrieved docs.

Don't blend those too early. A single score can tell you the app got worse, but it won't tell you whether the retriever, ranker, chunker, or prompt caused the regression.

Grade answers against the retrieved context

After retrieval passes, evaluate the answer itself. I care about four checks before I care about style.

Answer checkPass conditionCommon failure
CorrectnessThe answer matches the reference answer or policyLooks right but misses an exception
GroundingEvery factual claim is supported by retrieved contextModel adds a detail from memory
RelevanceThe answer addresses the actual user questionGeneric summary instead of answer
UsefulnessThe answer gives enough detail for the user to actAccurate but too thin

Ragas names related metrics like faithfulness, answer relevancy, context precision, and context recall. LlamaIndex documents similar evaluators for correctness, faithfulness, context relevancy, and answer relevancy. The names vary, but the split is practical: retrieve the right context, then prove the answer stays inside it.

Use LLM-as-judge carefully. It helps triage many cases after every prompt or retrieval change. But don't treat judge output as truth.

Keep a human-reviewed gold set for the cases that matter most. Use judges for broad regression signals. When a judge and a human disagree, update the rubric or mark the case as hard instead of pretending the score is exact.

Treat citations as evidence, not decoration

A RAG answer with citations can still be wrong. Sometimes the citation points to the right page but the claim isn't supported there. Sometimes the answer cites a general docs page when it needed a pricing table, changelog, or policy paragraph.

Your checklist should grade citations on support:

  • Does each citation open to a real source?
  • Does the cited source contain the exact claim?
  • Is the cited source current?
  • Does the answer avoid uncited factual claims?
  • Are multiple citations used only when they add support?

For app developers, this matters more than it sounds. Users trust answers with links. Bad citations make wrong answers more persuasive.

I prefer citation tests that ask for narrow claims. For example, don't only ask "How does billing work?" Ask "What happens to unused seats when I downgrade mid-cycle?" That forces the app to cite the exact billing rule, not a nearby docs page.

Add regression checks to your release flow

RAG apps change even when the model doesn't. Your docs change. Embeddings change. Chunking changes. Ranking weights change. A product manager edits one paragraph and suddenly the answer to a support question gets worse.

That is why evaluation needs to run as a regression suite.

  • System prompts or answer format instructions.
  • Embedding model or retrieval parameters.
  • Chunk size, overlap, or document parser.
  • Reranking logic.
  • Citation formatting.
  • The source corpus itself.

For model-specific upgrade work, keep that evaluation separate from app-level RAG testing. If you are swapping models, use a model upgrade checklist like the GPT-5.6 evaluation checklist. For RAG, the question is narrower: did this app still retrieve and cite the right evidence after the change?

If your app uses the Vercel AI SDK, tests can isolate model behavior with controlled outputs instead of calling a live model for every unit test. That won't replace RAG evals, but it helps keep UI and tool-call behavior stable. Teams planning SDK migrations can keep compatibility work in the AI SDK 7 upgrade checklist while this checklist stays focused on whether users can trust the answers.

Use a simple scoring rubric

A useful rubric is boring. Boring rubrics are easier to apply twice.

For each test case, score these fields from 0 to 2:

Field012
RetrievalExpected source missingSource present but low or mixed with junkRight source appears high
GroundingUnsupported claimsMostly supported with one weak claimFully supported by context
CitationMissing or wrong sourceSource is related but not exactCitation supports the claim
AnswerWrong or unsafePartly usefulCorrect and usable
RefusalAnswers when it shouldn't, or refuses valid questionUnclear boundaryCorrect boundary behavior

Then define release gates:

  • No critical test case can score 0 on retrieval or grounding.
  • Citation score must be 2 for legal, billing, security, and medical-like content.
  • Overall pass rate must stay above your current baseline.
  • Any new failure in a previously passing critical case blocks release.

The exact numbers matter less than consistency. If you change the rubric every week, the trend line becomes theater.

Inspect failures like product bugs

When a case fails, label the failure before changing the prompt.

Use a small taxonomy:

  • Corpus issue: the source doc is missing, stale, duplicated, or unclear.
  • Parsing issue: tables, PDFs, headings, or code blocks were ingested badly.
  • Chunking issue: the answer was split from the context that explains it.
  • Retrieval issue: the right chunk exists but wasn't found.
  • Ranking issue: the right chunk was found but buried.
  • Generation issue: the model ignored or stretched the context.
  • Citation issue: the answer was right but the source link was weak.
  • Product issue: the UI hid uncertainty or made citations hard to check.

This turns evals into engineering work instead of vague debate. Save every failed case as a regression test before fixing it, or the same bug comes back under a new prompt, model, or ingestion run.

The pre-launch RAG evaluation checklist

Before real users see the app, I want this checklist filled in:

AreaCheck
Test setAt least 50 reviewed cases across direct lookup, multi-hop, missing answers, stale docs, and citation stress
Retrieval logsTop chunks, scores, metadata, and final model context are stored for every run
Retrieval qualityExpected sources appear high enough to survive context limits
Answer qualityCorrectness, grounding, relevance, and usefulness are scored separately
CitationsEach factual claim has a source that supports it directly
RefusalsThe app refuses questions outside the corpus without inventing an answer
RegressionThe suite runs before prompt, retrieval, model, SDK, or corpus changes ship
Failure taxonomyFailures are labeled so the team knows where to fix them
Human reviewCritical cases have human-reviewed expected answers and source mappings
Release gateThe team has written thresholds for blocking release

This is cheaper than discovering, through users, that your support bot quotes old pricing or your internal assistant cites a policy that doesn't say what the answer claims.

Final take

RAG evaluation is mostly about discipline. Keep retrieval, grounding, citations, and regressions separate long enough to see which part broke.

A small, reviewed test set with clear failure labels will beat a giant spreadsheet of vague scores. Start there. Then add automation where it saves time without hiding the mistakes.

Sources

  • OpenAI Evals guide
  • OpenAI Cookbook: getting started with OpenAI Evals
  • LangSmith: evaluate a RAG application
  • Ragas metrics documentation
  • LlamaIndex evaluating guide
  • Vercel AI SDK testing documentation
Advertisement
DR

Dian Rijal Asyrof

Writes about useful AI tools, programming practice, and the craft of building reliable software.

Previous articleSmall Functions, Readable Code: A Practical Refactoring Guide
airagevaluationllms
Advertisement
Advertisement
On this page↓
  1. Start with the user-visible failure modes
  2. Build a test set that looks like real usage
  3. Check retrieval before you check generation
  4. Grade answers against the retrieved context
  5. Treat citations as evidence, not decoration
  6. Add regression checks to your release flow
  7. Use a simple scoring rubric
  8. Inspect failures like product bugs
  9. The pre-launch RAG evaluation checklist
  10. Final take
  11. Sources

On this page

  1. Start with the user-visible failure modes
  2. Build a test set that looks like real usage
  3. Check retrieval before you check generation
  4. Grade answers against the retrieved context
  5. Treat citations as evidence, not decoration
  6. Add regression checks to your release flow
  7. Use a simple scoring rubric
  8. Inspect failures like product bugs
  9. The pre-launch RAG evaluation checklist
  10. Final take
  11. Sources

See also

Illustration for GPT-5.6 Sol Preview: Why Model Upgrades Still Need Boring Evaluation
AI/Jun 29, 2026

GPT-5.6 Sol Preview: Why Model Upgrades Still Need Boring Evaluation

GPT-5.6 Sol may be stronger, but teams should test model upgrades with saved prompts, costs, latency, and failure cases before switching.

4 min read
gpt-5model-evaluation
Illustration for AI Coding Tools in 2026: What Actually Changed My Workflow
Programming/Jun 28, 2026

AI Coding Tools in 2026: What Actually Changed My Workflow

I switched from VS Code to Cursor eight months ago. Here's what works, what's still annoying, and which AI coding tool is worth your money.

5 min read
aicoding-tools
Illustration for RAMageddon: Why Your Next Laptop Will Cost More in 2026
Technology/Jun 26, 2026

RAMageddon: Why Your Next Laptop Will Cost More in 2026

DRAM prices have surged 170% as AI data centers devour memory supply. Here's what's causing the shortage, who's winning, and what you should actually do about it.

6 min read
hardwarememory