Evaluating an AI Agent
- Updated: 2026/05/12
AI Evaluations measure agent behavior across tool use, task completion, and execution path to build reliability before an agent is deployed at scale.
Overview
AI Evaluations offers developers a systematic, evidence-based method to qualify AI Agents before they are used at scale. Using LLM-as-a-judge scoring, AI Evaluations runs the agent through a defined set of test cases and measures performance across four dimensions: tool selection accuracy, goal completion, trajectory accuracy, and human input rate. Results are available at both a summary level and a detailed per-case level to identify exactly where an agent succeeded, where it deviated, and why.
- Confirm the goal was accurately completed and where the agent failed through the reasoning provided.
- Validate that tool calls were made in the correct sequence with correct inputs.
- Confirm that task objectives were met and outputs matched expected values.
- Capture feedback and annotations to guide iterative improvements.
This traceability reduces deployment risk and supports human-in-the-loop quality assurance.
Detailed Evaluation View
The Detailed Evaluation View surfaces the full context of each test case in a single interface. This depth of visibility accelerates debugging, supports iterative improvement, and provides the traceability needed to make confident deployment decisions. Evaluations can be run automatically against a data set or manually reviewed.
This detailed view presents per-case results from a completed evaluation run. For each test case, the view displays:
- Metric scores and rationale for Tool Selection, Goal Completion, Trajectory Accuracy, and Human Input.
- Actual tool calls sequences, including tool name, inputs, and tool selection score.
- Input variables and expected output variables (if provided).
- Manual evaluations offer user feedback (like/dislike).
Run and review the agent evaluation
Evaluations can be run on agents in development as well as in production. The evaluation can be performed directly from the AI Agent page or from the Evaluations landing page. Access the Detailed Evaluation View from the Summary page or after an evaluation run completes.
Metrics
Evaluations use LLM-as-a-judge to score agent overall performance automatically, or human review to score them manually. Each test case in an evaluation run is assessed against the four turn-level and session-level, agent-specific metrics below.
| Tool Selection | Percentage (0–100%) | How accurately the agent selected and used the appropriate tools for each turn is assessed. | turn-level |
| Goal Completion | Pass / Fail | Whether the agent successfully completed the assigned task objective. | session-level |
| Trajectory Accuracy | Pass / Fail | Whether the agent followed the expected path of tool calls and steps to reach the outcome. | session-level |
| Human Input | Percentage (0–100%) | The proportion of agent steps that required human intervention during execution. | turn-level |