Automation 360

AI Evaluations summary view

Download as PDF

AI Evaluations summary view

Download as PDF

Updated: 2026/05/12

Review aggregate performance metrics and per-case results for a completed evaluation run.

Overview

The Evaluation summary view offers a single page to review aggregate scores, inspect individual test case results, trace tool call sequences, and validate output accuracy. This feature is intended for developers and managers that are building, testing, and refining agents and skills.

The Evaluation Summary View is the starting point for reviewing a completed evaluation run. It displays high-level metadata and aggregate performance metrics, followed by a row-level breakdown of all test cases in the dataset.

Evaluation metadata

The header area displays the following run-level details:

AI Agent Evaluated: name and version of the agent under test
Evaluation Method: Automatic or manual
Evaluated On: date the evaluation was run
Run By: user who initiated the evaluation
Model Evaluated: the LLM used (e.g., OpenAI GPT 3.4)
Total Time: wall-clock duration of the full evaluation run
Total Tokens: cumulative token usage across all test cases
Dataset: link to download the evaluation dataset

Aggregate metrics

Below the metadata, six summary metrics give an at-a-glance view of overall agent performance:


Metric	Description	Rating Scale
Tool Selection	Average score for how accurately the agent selected the appropriate tools across all cases.	Good / Fair / Poor
Goal Completion	Pass/Fail rate indicating whether the agent successfully completed its assigned task.	Good / Fair / Poor
Trajectory Accuracy	Score measuring how closely the agent followed the expected execution path.	Good / Fair / Poor
Human Input	Percentage of cases requiring human intervention during execution.	Percentage (%)
Average Time Taken	Mean execution time per test case.	Hours / Minutes / Seconds
Average Tokens	Mean token consumption per test case.	Count

Metric ratings (Good, Fair, Poor) are based on score thresholds: above 80% is Good, 50–79% is Fair, and below 50% is Poor. Metrics displayed in red indicate a failing result.

Output details table

The Output Details table lists individual test case results. By default, 10 rows are shown per page, with pagination controls at the bottom. The table includes the following columns:

ID — sequential case identifier
Tools — number of tools invoked
Agent Output — truncated summary of the agent's final output; hover to expand
Status — Completed or Failed execution status
Tool Selection — per-case tool selection score (highlighted in red if below threshold)
Human Input — Yes or No indicator
Goal Completion — Pass or Fail
Trajectory Accuracy — Pass or Fail (highlighted in red if failed)
Time Taken — execution duration in seconds
User Review — Like, Dislike, or blank if not reviewed
Tokens Consumed — token count for the case

Rows with data validation errors display an error badge inline. Hovering over the badge reveals the error message. Cases where data could not be evaluated show N/A across all metric columns.

Select any row to open the Detailed Evaluation View for a full inspection of a single test case result. Use the breadcrumb at the top of the page to return to the summary.

Metrics breakdown

Four metric cards are displayed at the top of the detailed view. Each card shows the score or pass/fail result along with a rationale field—the LLM reasoning or evaluator notes explaining how the result was determined.


Metric	Result Format	Rationale Field
Tool Selection	Score (%) + Pass/Fail	Explanation of tool selection accuracy across steps.
Goal Completion	Pass / Fail	Notes on whether task completion criteria were met.
Trajectory Accuracy	Score (0–1)	Explanation of deviation from expected path, if any.
Human Input	Yes / No	Indicates whether human-in-the-loop intervention occurred.

Agent output

The full text of the agent's final output is displayed, including any LLM-generated reasoning. This section also shows the final tool call result, its execution status (Success, Failed, or Canceled), and the output variables returned.

Tool call sequence

The Tool Calls section lists every tool the agent executed during the case, in order. Each row in the table includes:

Actual Tool Call: name and type of the tool invoked
Type of Actual Tool Call: categorized as AI agent, API task, Human in the loop, Process, or Task bot
Inputs: input variables passed to the tool

When an expected tool call sequence is configured for the evaluation, a comparison view shows the expected sequence alongside the actual sequence, enabling direct identification of deviations.

Dataset: inputs and outputs

The Dataset section at the bottom of the page displays the input and output variable sets for the test case.

Agent Input Variables: variable name, type, and value for each input provided to the agent
Agent Expected Output Variables: variable name, type, and expected value; indicates whether the actual output matched

Feedback and annotations

Reviewers can provide structured feedback directly on a test case:

User Feedback: Like or Dislike rating with an optional comment.
Annotations: notes attached to specific tools or steps, categorized as Observation, Error, or Suggestion.

Use the Add Annotation button to attach a note to any step in the tool call sequence. Annotations support the human-in-the-loop review cycle and are visible to other evaluators.