AI Evaluations measure agent behavior across tool use, task completion, and execution path to build reliability before an agent is deployed at scale.

Overview

AI Evaluations offers developers a systematic, evidence-based method to qualify AI Agents before they are used at scale. Using LLM-as-a-judge scoring, AI Evaluations runs the agent through a defined set of test cases and measures performance across four dimensions: tool selection accuracy, goal completion, trajectory accuracy, and human input rate. Results are available at both a summary level and a detailed per-case level to identify exactly where an agent succeeded, where it deviated, and why.

Note: Upload data set is not currently available.
As AI Agents grow in complexity, manual inspection can be cumbersome and insufficient to ensure reliability. AI Evaluations provides a systematic, repeatable method for validating agent behavior against defined expectations. Detailed per-case results help developers:
  • Confirm the goal was accurately completed and where the agent failed through the reasoning provided.
  • Validate that tool calls were made in the correct sequence with correct inputs.
  • Confirm that task objectives were met and outputs matched expected values.
  • Capture feedback and annotations to guide iterative improvements.

This traceability reduces deployment risk and supports human-in-the-loop quality assurance.

Detailed Evaluation View

The Detailed Evaluation View surfaces the full context of each test case in a single interface. This depth of visibility accelerates debugging, supports iterative improvement, and provides the traceability needed to make confident deployment decisions. Evaluations can be run automatically against a data set or manually reviewed.

This detailed view presents per-case results from a completed evaluation run. For each test case, the view displays:

  • Metric scores and rationale for Tool Selection, Goal Completion, Trajectory Accuracy, and Human Input.
  • Actual tool calls sequences, including tool name, inputs, and tool selection score.
  • Input variables and expected output variables (if provided).
  • Manual evaluations offer user feedback (like/dislike).

Run and review the agent evaluation

Evaluations can be run on agents in development as well as in production. The evaluation can be performed directly from the AI Agent page or from the Evaluations landing page. Access the Detailed Evaluation View from the Summary page or after an evaluation run completes.

Metrics

Evaluations use LLM-as-a-judge to score agent overall performance automatically, or human review to score them manually. Each test case in an evaluation run is assessed against the four turn-level and session-level, agent-specific metrics below.

Table 1.
Tool Selection Percentage (0–100%) How accurately the agent selected and used the appropriate tools for each turn is assessed. turn-level
Goal Completion Pass / Fail Whether the agent successfully completed the assigned task objective. session-level
Trajectory Accuracy Pass / Fail Whether the agent followed the expected path of tool calls and steps to reach the outcome. session-level
Human Input Percentage (0–100%) The proportion of agent steps that required human intervention during execution. turn-level
Note: During an evaluation, execution of the AI Agent terminates at the human input step. Scoring is provided for the agent sequence up to this step for human input.