Agent Evaluation Framework

You would not hire an employee without an interview. Do not deploy an agent without an exam.

Evaluation is the discipline that decides whether an agent is good enough to ship and whether the next change made it better or worse. Skip it and you are guessing. Do it well and you can change models, prompts, and tools without holding your breath.

This page extends the Golden Q&A Set concept introduced in the Hallucination Prevention Protocol into a full evaluation discipline for agents.

The two questions evaluation must answer

Is this agent good enough to ship? (absolute quality)
Did the latest change make it better or worse? (regression detection)

A good eval suite answers both. A weak eval suite — vibes-checking three examples in a chat window — answers neither.

1. Build the golden set

The golden set is a collection of representative inputs paired with known-good outputs.

Property	What good looks like
Size	50 minimum, 100–300 typical, 1000+ for high-stakes agents.
Coverage	Every documented use case + every known failure mode.
Representativeness	Drawn from real production traffic where possible.
Stability	Inputs and expected outputs are version-controlled.
Authority	Each item has a named subject-matter expert who signed off.

How to seed it:

Start with the top 20 questions or tasks the agent will get.
Add the top 10 things the agent must refuse to do.
Add the top 10 ambiguous cases where you want a clarifying question, not an answer.
Add every bug your pilot users find — see "Lock in regressions" below.

2. Define a scoring rubric

Pick a rubric that matches the work. Common shapes:

Rubric	Use when	Mechanic
Pass / fail	The right answer is unambiguous (extracted SQL, structured field, classification).	Exact or normalized match. Easy to automate.
Pass / fail with reason	Same as above, but you want failure-mode tracking.	Tag each failure (`hallucinated`, `wrong tool`, `over-edited`, etc.).
Rubric scoring	The output is prose.	1–5 across dimensions (faithfulness, completeness, tone, format).
LLM-as-judge	Volume is high, prose-heavy.	A separate prompted model scores against the rubric. Calibrate against human scores on a sample, or you are scoring noise.
Tool-trace check	Agent behavior matters more than text.	Did the agent call the right tools, in the right order, with the right args?

Document the rubric. Two reviewers using the same rubric on the same output should agree most of the time. If they don't, the rubric is broken.

3. Set the ship gate

The ship gate is the threshold the agent must hit on the golden set before it can be released.

Risk class	Typical threshold
Internal drafting, summarization	90% pass
Customer-facing answers	95% pass
Data extraction for audits, regulated workflows	99%+ pass, plus zero hallucinations
Anything that mutates a system of record	99%+ pass, plus 100% on the "must refuse" subset

The threshold goes in writing. Below it, the change does not ship — including the very first release.

4. Run evals as a gate, not a one-time event

Wire the eval suite into the same place every other code change goes through.

Every change runs the eval:

Prompt edits.
Model upgrades (GPT-4 → GPT-5, Claude version bumps, switching providers).
Tool changes (a new MCP server, a renamed parameter, a swapped retrieval store).
Data refreshes (the underlying knowledge base changed).

5. Lock in regressions

Every production bug becomes a permanent eval item.

When a user reports a problem:

Reproduce it.
Add the failing input and the correct expected output to the golden set.
Verify the new item fails on the current agent.
Patch the agent.
Verify the new item passes — and that nothing that previously passed now fails.

This is how the suite gets sharper over time instead of going stale.

6. A/B prompts and online evaluation

Offline eval on the golden set tells you the agent is good enough to ship. Online evaluation tells you it is actually working in production.

Technique	What it does
Side-by-side prompts	Run prompt A and prompt B against the same input; humans pick the winner on a sample.
Shadow mode	New version runs in parallel without serving the user; outputs are compared.
Canary	New version serves a small percentage of traffic; key metrics are watched.
Production sampling	Random sample of real conversations is rated weekly against the rubric.
User feedback	Thumbs up / down with a free-text reason; feed downvotes back into the golden set.

7. What to track

Metric	What it tells you
Eval pass rate (overall + by category)	Is the agent good enough?
Pass-rate trend	Are we improving or drifting?
Failure mode breakdown	Where does it break — hallucination, wrong tool, format, tone?
Time per eval item	Cost of running the suite; flag slow items.
Human–judge agreement (when using LLM-as-judge)	Is the auto-score trustworthy?
HITL approval rate (production)	The honest measure of trust.

8. Common evaluation mistakes

Vibes-only. Three chat examples is not an eval; it's an anecdote.
Static set. A golden set that never gets new items rots.
No "must refuse" cases. You will only learn about over-eager behavior in production.
LLM-as-judge without calibration. You are now grading noise.
Evaluating only happy paths. Edge cases live in production whether you test them or not.
No rollback. If the eval drops, you need a one-toggle path back to the previous version.

Evaluation launch checklist

Golden set built and version-controlled.
Rubric documented; two reviewers agree most of the time.
Ship-gate threshold written down and approved.
Eval runs in CI on every prompt, model, or tool change.
Failure modes are categorized and tracked.
Production bugs are converted to permanent eval items.
Online evaluation (sampling, feedback, or shadow) is wired in.
Rollback path is verified.

Need help implementing or feeling stuck? Contact us today to establish a consulting relationship.

The two questions evaluation must answer​

1. Build the golden set​

2. Define a scoring rubric​

3. Set the ship gate​

4. Run evals as a gate, not a one-time event​

5. Lock in regressions​

6. A/B prompts and online evaluation​

7. What to track​

8. Common evaluation mistakes​

Evaluation launch checklist​