Skip to main content

Agent Evaluation Framework

You would not hire an employee without an interview. Do not deploy an agent without an exam.

Evaluation is the discipline that decides whether an agent is good enough to ship and whether the next change made it better or worse. Skip it and you are guessing. Do it well and you can change models, prompts, and tools without holding your breath.

This page extends the Golden Q&A Set concept introduced in the Hallucination Prevention Protocol into a full evaluation discipline for agents.

The two questions evaluation must answer

  1. Is this agent good enough to ship? (absolute quality)
  2. Did the latest change make it better or worse? (regression detection)

A good eval suite answers both. A weak eval suite — vibes-checking three examples in a chat window — answers neither.

1. Build the golden set

The golden set is a collection of representative inputs paired with known-good outputs.

PropertyWhat good looks like
Size50 minimum, 100–300 typical, 1000+ for high-stakes agents.
CoverageEvery documented use case + every known failure mode.
RepresentativenessDrawn from real production traffic where possible.
StabilityInputs and expected outputs are version-controlled.
AuthorityEach item has a named subject-matter expert who signed off.

How to seed it:

  • Start with the top 20 questions or tasks the agent will get.
  • Add the top 10 things the agent must refuse to do.
  • Add the top 10 ambiguous cases where you want a clarifying question, not an answer.
  • Add every bug your pilot users find — see "Lock in regressions" below.

2. Define a scoring rubric

Pick a rubric that matches the work. Common shapes:

RubricUse whenMechanic
Pass / failThe right answer is unambiguous (extracted SQL, structured field, classification).Exact or normalized match. Easy to automate.
Pass / fail with reasonSame as above, but you want failure-mode tracking.Tag each failure (hallucinated, wrong tool, over-edited, etc.).
Rubric scoringThe output is prose.1–5 across dimensions (faithfulness, completeness, tone, format).
LLM-as-judgeVolume is high, prose-heavy.A separate prompted model scores against the rubric. Calibrate against human scores on a sample, or you are scoring noise.
Tool-trace checkAgent behavior matters more than text.Did the agent call the right tools, in the right order, with the right args?

Document the rubric. Two reviewers using the same rubric on the same output should agree most of the time. If they don't, the rubric is broken.

3. Set the ship gate

The ship gate is the threshold the agent must hit on the golden set before it can be released.

Risk classTypical threshold
Internal drafting, summarization90% pass
Customer-facing answers95% pass
Data extraction for audits, regulated workflows99%+ pass, plus zero hallucinations
Anything that mutates a system of record99%+ pass, plus 100% on the "must refuse" subset

The threshold goes in writing. Below it, the change does not ship — including the very first release.

4. Run evals as a gate, not a one-time event

Wire the eval suite into the same place every other code change goes through.

Every change runs the eval:

  • Prompt edits.
  • Model upgrades (GPT-4 → GPT-5, Claude version bumps, switching providers).
  • Tool changes (a new MCP server, a renamed parameter, a swapped retrieval store).
  • Data refreshes (the underlying knowledge base changed).

5. Lock in regressions

Every production bug becomes a permanent eval item.

When a user reports a problem:

  1. Reproduce it.
  2. Add the failing input and the correct expected output to the golden set.
  3. Verify the new item fails on the current agent.
  4. Patch the agent.
  5. Verify the new item passes — and that nothing that previously passed now fails.

This is how the suite gets sharper over time instead of going stale.

6. A/B prompts and online evaluation

Offline eval on the golden set tells you the agent is good enough to ship. Online evaluation tells you it is actually working in production.

TechniqueWhat it does
Side-by-side promptsRun prompt A and prompt B against the same input; humans pick the winner on a sample.
Shadow modeNew version runs in parallel without serving the user; outputs are compared.
CanaryNew version serves a small percentage of traffic; key metrics are watched.
Production samplingRandom sample of real conversations is rated weekly against the rubric.
User feedbackThumbs up / down with a free-text reason; feed downvotes back into the golden set.

7. What to track

MetricWhat it tells you
Eval pass rate (overall + by category)Is the agent good enough?
Pass-rate trendAre we improving or drifting?
Failure mode breakdownWhere does it break — hallucination, wrong tool, format, tone?
Time per eval itemCost of running the suite; flag slow items.
Human–judge agreement (when using LLM-as-judge)Is the auto-score trustworthy?
HITL approval rate (production)The honest measure of trust.

8. Common evaluation mistakes

  • Vibes-only. Three chat examples is not an eval; it's an anecdote.
  • Static set. A golden set that never gets new items rots.
  • No "must refuse" cases. You will only learn about over-eager behavior in production.
  • LLM-as-judge without calibration. You are now grading noise.
  • Evaluating only happy paths. Edge cases live in production whether you test them or not.
  • No rollback. If the eval drops, you need a one-toggle path back to the previous version.

Evaluation launch checklist

  • Golden set built and version-controlled.
  • Rubric documented; two reviewers agree most of the time.
  • Ship-gate threshold written down and approved.
  • Eval runs in CI on every prompt, model, or tool change.
  • Failure modes are categorized and tracked.
  • Production bugs are converted to permanent eval items.
  • Online evaluation (sampling, feedback, or shadow) is wired in.
  • Rollback path is verified.

Need help implementing or feeling stuck? Contact us today to establish a consulting relationship.