The rough version of the lesson, since I’ll spend the rest of this post arguing for it: a tool-using agent that confidently calls the wrong tool is worse than one that says “I don’t know.” Most evaluation harnesses score for capability and silently penalize calibration. We had to build one that does the opposite.
The failure mode that almost shipped
Last spring we were tuning a release that scored higher than its predecessor on every benchmark we cared about: tool-call accuracy up 4 points, end-to-end task completion up 7. We almost shipped it. Two days before rollout, an internal user reported a quiet failure: when asked to “check the calendar”, the model called send_email with a hallucinated subject. It had never seen check_calendar in its tool list — but instead of saying so, it picked the closest-looking tool and pressed the button.
It turned out our benchmark only graded the cases where the right tool existed. Of course the model got better at picking among tools that included the right one. We weren’t scoring the much more common production case: the user asks for something the agent cannot do, and the right answer is to say so.
What we measure now
Three numbers, on every release:
- Capability. Among tasks where the correct tool exists in the schema, does the agent reach the right end state? Standard.
- Refusal calibration. Among tasks where no correct tool exists, does the agent decline cleanly instead of confabulating? We sample held-out user requests and remove relevant tools from the schema before scoring. The cost of a wrong call here is much higher than the cost of a false refusal.
- Recovery. When a tool call returns an error or unexpected payload, does the agent inspect it, replan, and try a different approach? Or does it loop, calling the same broken tool with minor argument tweaks? We seed deterministic “tool returned 502” injection at decode time.
Releases that ace 1 and regress on 2 don’t ship. We learned this the unfun way.
Prompting tricks that helped
Most of the calibration win came from training data, not prompting — but a few prompt-level changes were worth the diff:
- Show the tool schema as a list with explicit absences. Don’t imply that the listed tools are exhaustive; say it. “The tools available to you are exactly the following. If the user’s request cannot be satisfied with these, say so plainly.”
- Ask for an intent statement before the call. One sentence — “I’m going to use
search_docsbecause the user asked X.” It forces a decision the model can second-guess instead of bolting straight to a function call. - Surface the tool error verbatim, not a summary. Models recover faster from raw stack traces than from “tool failed.” The verbatim error gives the next decode step real information.
Things we tried that didn’t move the needle
Confidence thresholds (“only call a tool when log-probability of the call exceeds X”) were appealing but didn’t work. The model is overconfident on tool calls in roughly the same proportion as it is on text — thresholds either filtered out everything or nothing.
“Reflect before you act” chain-of-thought prompts helped a little on capability and slightly hurt calibration: the agent talked itself into calls it wouldn’t otherwise have made. We dropped them.
The broader point
Capability is exciting. Calibration is what makes a tool-using agent safe to point at production data. If you can only afford to measure one thing, measure refusal — and pay the model a real reward for saying “I don’t have a way to do that.”