Review aggregate performance metrics and per-case results for a completed evaluation run.

Overview

The Evaluation summary view offers a single page to review aggregate scores, inspect individual test case results, trace tool call sequences, and validate output accuracy. This feature is intended for developers and managers that are building, testing, and refining agents and skills.

The Evaluation Summary View is the starting point for reviewing a completed evaluation run. It displays high-level metadata and aggregate performance metrics, followed by a row-level breakdown of all test cases in the dataset.

Evaluation metadata

The header area displays the following run-level details:

  • AI Agent Evaluated: name and version of the agent under test
  • Evaluation Method: Automatic or manual
  • Evaluated On: date the evaluation was run
  • Run By: user who initiated the evaluation
  • Model Evaluated: the LLM used (e.g., OpenAI GPT 3.4)
  • Total Time: wall-clock duration of the full evaluation run
  • Total Tokens: cumulative token usage across all test cases
  • Dataset: link to download the evaluation dataset

Aggregate metrics

Below the metadata, six summary metrics give an at-a-glance view of overall agent performance:

Metric Description Rating Scale
Tool Selection Average score for how accurately the agent selected the appropriate tools across all cases. Good / Fair / Poor
Goal Completion Pass/Fail rate indicating whether the agent successfully completed its assigned task. Good / Fair / Poor
Trajectory Accuracy Score measuring how closely the agent followed the expected execution path. Good / Fair / Poor
Human Input Percentage of cases requiring human intervention during execution. Percentage (%)
Average Time Taken Mean execution time per test case. Hours / Minutes / Seconds
Average Tokens Mean token consumption per test case. Count

Metric ratings (Good, Fair, Poor) are based on score thresholds: above 80% is Good, 50–79% is Fair, and below 50% is Poor. Metrics displayed in red indicate a failing result.

Output details table

The Output Details table lists individual test case results. By default, 10 rows are shown per page, with pagination controls at the bottom. The table includes the following columns:

  • ID — sequential case identifier
  • Tools — number of tools invoked
  • Agent Output — truncated summary of the agent's final output; hover to expand
  • Status — Completed or Failed execution status
  • Tool Selection — per-case tool selection score (highlighted in red if below threshold)
  • Human Input — Yes or No indicator
  • Goal Completion — Pass or Fail
  • Trajectory Accuracy — Pass or Fail (highlighted in red if failed)
  • Time Taken — execution duration in seconds
  • User Review — Like, Dislike, or blank if not reviewed
  • Tokens Consumed — token count for the case

Rows with data validation errors display an error badge inline. Hovering over the badge reveals the error message. Cases where data could not be evaluated show N/A across all metric columns.

Select any row to open the Detailed Evaluation View for a full inspection of a single test case result. Use the breadcrumb at the top of the page to return to the summary.

Metrics breakdown

Four metric cards are displayed at the top of the detailed view. Each card shows the score or pass/fail result along with a rationale field—the LLM reasoning or evaluator notes explaining how the result was determined.

Metric Result Format Rationale Field
Tool Selection Score (%) + Pass/Fail Explanation of tool selection accuracy across steps.
Goal Completion Pass / Fail Notes on whether task completion criteria were met.
Trajectory Accuracy Score (0–1) Explanation of deviation from expected path, if any.
Human Input Yes / No Indicates whether human-in-the-loop intervention occurred.

Agent output

The full text of the agent's final output is displayed, including any LLM-generated reasoning. This section also shows the final tool call result, its execution status (Success, Failed, or Canceled), and the output variables returned.

Tool call sequence

The Tool Calls section lists every tool the agent executed during the case, in order. Each row in the table includes:

  • Actual Tool Call: name and type of the tool invoked
  • Type of Actual Tool Call: categorized as AI agent, API task, Human in the loop, Process, or Task bot
  • Inputs: input variables passed to the tool

When an expected tool call sequence is configured for the evaluation, a comparison view shows the expected sequence alongside the actual sequence, enabling direct identification of deviations.

Dataset: inputs and outputs

The Dataset section at the bottom of the page displays the input and output variable sets for the test case.

  • Agent Input Variables: variable name, type, and value for each input provided to the agent
  • Agent Expected Output Variables: variable name, type, and expected value; indicates whether the actual output matched

Feedback and annotations

Reviewers can provide structured feedback directly on a test case:

  • User Feedback: Like or Dislike rating with an optional comment.
  • Annotations: notes attached to specific tools or steps, categorized as Observation, Error, or Suggestion.

Use the Add Annotation button to attach a note to any step in the tool call sequence. Annotations support the human-in-the-loop review cycle and are visible to other evaluators.