Agent Evaluation Framework
You would not hire an employee without an interview. Do not deploy an agent without an exam.
Evaluation is the discipline that decides whether an agent is good enough to ship and whether the next change made it better or worse. Skip it and you are guessing. Do it well and you can change models, prompts, and tools without holding your breath.
This page extends the Golden Q&A Set concept introduced in the Hallucination Prevention Protocol into a full evaluation discipline for agents.
The two questions evaluation must answer
- Is this agent good enough to ship? (absolute quality)
- Did the latest change make it better or worse? (regression detection)
A good eval suite answers both. A weak eval suite — vibes-checking three examples in a chat window — answers neither.
1. Build the golden set
The golden set is a collection of representative inputs paired with known-good outputs.
| Property | What good looks like |
|---|---|
| Size | 50 minimum, 100–300 typical, 1000+ for high-stakes agents. |
| Coverage | Every documented use case + every known failure mode. |
| Representativeness | Drawn from real production traffic where possible. |
| Stability | Inputs and expected outputs are version-controlled. |
| Authority | Each item has a named subject-matter expert who signed off. |
How to seed it:
- Start with the top 20 questions or tasks the agent will get.
- Add the top 10 things the agent must refuse to do.
- Add the top 10 ambiguous cases where you want a clarifying question, not an answer.
- Add every bug your pilot users find — see "Lock in regressions" below.
2. Define a scoring rubric
Pick a rubric that matches the work. Common shapes:
| Rubric | Use when | Mechanic |
|---|---|---|
| Pass / fail | The right answer is unambiguous (extracted SQL, structured field, classification). | Exact or normalized match. Easy to automate. |
| Pass / fail with reason | Same as above, but you want failure-mode tracking. | Tag each failure (hallucinated, wrong tool, over-edited, etc.). |
| Rubric scoring | The output is prose. | 1–5 across dimensions (faithfulness, completeness, tone, format). |
| LLM-as-judge | Volume is high, prose-heavy. | A separate prompted model scores against the rubric. Calibrate against human scores on a sample, or you are scoring noise. |
| Tool-trace check | Agent behavior matters more than text. | Did the agent call the right tools, in the right order, with the right args? |
Document the rubric. Two reviewers using the same rubric on the same output should agree most of the time. If they don't, the rubric is broken.
3. Set the ship gate
The ship gate is the threshold the agent must hit on the golden set before it can be released.
| Risk class | Typical threshold |
|---|---|
| Internal drafting, summarization | 90% pass |
| Customer-facing answers | 95% pass |
| Data extraction for audits, regulated workflows | 99%+ pass, plus zero hallucinations |
| Anything that mutates a system of record | 99%+ pass, plus 100% on the "must refuse" subset |
The threshold goes in writing. Below it, the change does not ship — including the very first release.
4. Run evals as a gate, not a one-time event
Wire the eval suite into the same place every other code change goes through.
Every change runs the eval:
- Prompt edits.
- Model upgrades (GPT-4 → GPT-5, Claude version bumps, switching providers).
- Tool changes (a new MCP server, a renamed parameter, a swapped retrieval store).
- Data refreshes (the underlying knowledge base changed).
5. Lock in regressions
Every production bug becomes a permanent eval item.
When a user reports a problem:
- Reproduce it.
- Add the failing input and the correct expected output to the golden set.
- Verify the new item fails on the current agent.
- Patch the agent.
- Verify the new item passes — and that nothing that previously passed now fails.
This is how the suite gets sharper over time instead of going stale.
6. A/B prompts and online evaluation
Offline eval on the golden set tells you the agent is good enough to ship. Online evaluation tells you it is actually working in production.
| Technique | What it does |
|---|---|
| Side-by-side prompts | Run prompt A and prompt B against the same input; humans pick the winner on a sample. |
| Shadow mode | New version runs in parallel without serving the user; outputs are compared. |
| Canary | New version serves a small percentage of traffic; key metrics are watched. |
| Production sampling | Random sample of real conversations is rated weekly against the rubric. |
| User feedback | Thumbs up / down with a free-text reason; feed downvotes back into the golden set. |
7. What to track
| Metric | What it tells you |
|---|---|
| Eval pass rate (overall + by category) | Is the agent good enough? |
| Pass-rate trend | Are we improving or drifting? |
| Failure mode breakdown | Where does it break — hallucination, wrong tool, format, tone? |
| Time per eval item | Cost of running the suite; flag slow items. |
| Human–judge agreement (when using LLM-as-judge) | Is the auto-score trustworthy? |
| HITL approval rate (production) | The honest measure of trust. |
8. Common evaluation mistakes
- Vibes-only. Three chat examples is not an eval; it's an anecdote.
- Static set. A golden set that never gets new items rots.
- No "must refuse" cases. You will only learn about over-eager behavior in production.
- LLM-as-judge without calibration. You are now grading noise.
- Evaluating only happy paths. Edge cases live in production whether you test them or not.
- No rollback. If the eval drops, you need a one-toggle path back to the previous version.
Evaluation launch checklist
- Golden set built and version-controlled.
- Rubric documented; two reviewers agree most of the time.
- Ship-gate threshold written down and approved.
- Eval runs in CI on every prompt, model, or tool change.
- Failure modes are categorized and tracked.
- Production bugs are converted to permanent eval items.
- Online evaluation (sampling, feedback, or shadow) is wired in.
- Rollback path is verified.
Need help implementing or feeling stuck? Contact us today to establish a consulting relationship.