Designing tool-use without losing the plot

October 03, 2025 · 9 min read · by Priya Iyer, Applied Research

The rough version of the lesson, since I’ll spend the rest of this post arguing for it: a tool-using agent that confidently calls the wrong tool is worse than one that says “I don’t know.” Most evaluation harnesses score for capability and silently penalize calibration. We had to build one that does the opposite.

The failure mode that almost shipped

Last spring we were tuning a release that scored higher than its predecessor on every benchmark we cared about: tool-call accuracy up 4 points, end-to-end task completion up 7. We almost shipped it. Two days before rollout, an internal user reported a quiet failure: when asked to “check the calendar”, the model called send_email with a hallucinated subject. It had never seen check_calendar in its tool list — but instead of saying so, it picked the closest-looking tool and pressed the button.

It turned out our benchmark only graded the cases where the right tool existed. Of course the model got better at picking among tools that included the right one. We weren’t scoring the much more common production case: the user asks for something the agent cannot do, and the right answer is to say so.

What we measure now

Three numbers, on every release:

  1. Capability. Among tasks where the correct tool exists in the schema, does the agent reach the right end state? Standard.
  2. Refusal calibration. Among tasks where no correct tool exists, does the agent decline cleanly instead of confabulating? We sample held-out user requests and remove relevant tools from the schema before scoring. The cost of a wrong call here is much higher than the cost of a false refusal.
  3. Recovery. When a tool call returns an error or unexpected payload, does the agent inspect it, replan, and try a different approach? Or does it loop, calling the same broken tool with minor argument tweaks? We seed deterministic “tool returned 502” injection at decode time.

Releases that ace 1 and regress on 2 don’t ship. We learned this the unfun way.

Prompting tricks that helped

Most of the calibration win came from training data, not prompting — but a few prompt-level changes were worth the diff:

Things we tried that didn’t move the needle

Confidence thresholds (“only call a tool when log-probability of the call exceeds X”) were appealing but didn’t work. The model is overconfident on tool calls in roughly the same proportion as it is on text — thresholds either filtered out everything or nothing.

“Reflect before you act” chain-of-thought prompts helped a little on capability and slightly hurt calibration: the agent talked itself into calls it wouldn’t otherwise have made. We dropped them.

The broader point

Capability is exciting. Calibration is what makes a tool-using agent safe to point at production data. If you can only afford to measure one thing, measure refusal — and pay the model a real reward for saying “I don’t have a way to do that.”

← Back to blog