2026-04-12·5 min read· #multi-agent#evaluation

Notes from Building Multi-Agent Eval Harnesses

Evaluating a single LLM call is hard. Evaluating a system where three or four agents call each other, retrieve documents, write to a scratchpad, and finally produce an answer is hard in different ways. Here's what I've learned from building eval harnesses for the agentic side of Trend Radar.

Score the trace, not just the output

If you only score the final answer, you can't tell whether a wrong answer came from a bad retrieval, a bad reasoning step, or a final-mile summarization slip. Every fix has to start from the trace.

Concretely: log every agent step as a structured event. Then run eval functions on each step in isolation. Per-step accuracy is more diagnostic than per-output accuracy.

Build the failure taxonomy first

Before you write a single eval function, sit down with 30 real failure cases and write a one-line label for each. The labels become your taxonomy. Common ones I see:

Retrieved nothing useful
Retrieved the right thing, used the wrong part
Hallucinated a tool argument
Looped between agents without progress
Refused a valid question

Each label becomes a metric. "Loop rate" is a number you can stare at. "Hallucinated tool args per 1000 calls" is a number you can drive down.

Separate deterministic and judged checks

Some failures are checkable in code: tool argument schema, JSON parseability, citation presence. Other failures need an LLM judge or a human: "did the agent's summary actually capture the document?"

Run the deterministic checks on every example, every run. Run the judged checks on a sampled slice. The combination is fast enough to put in CI and rigorous enough to catch the things that matter.

Cache aggressively

Multi-agent traces are expensive. Reproducing them costs you another full chain of calls. Cache every LLM call by hash of (model, prompt, params) so re-running the eval suite is mostly hitting the cache and only re-evaluating the changed bits.

The output you actually want

The eval harness should produce a markdown report you can paste into a PR description: per-metric score, delta vs the previous run, and a sample of 3 representative failures. If a reviewer can't see what changed in 30 seconds, the harness is just a number generator.

Closing thought

Most teams treat multi-agent eval as "we'll figure it out when it works." It never works without it. The harness is the system. Build it on day one, even if it only checks three things.

← All posts