AI Evaluations summary view
- Updated: 2026/05/12
Review aggregate performance metrics and per-case results for a completed evaluation run.
Overview
The Evaluation summary view offers a single page to review aggregate scores, inspect individual test case results, trace tool call sequences, and validate output accuracy. This feature is intended for developers and managers that are building, testing, and refining agents and skills.
The Evaluation Summary View is the starting point for reviewing a completed evaluation run. It displays high-level metadata and aggregate performance metrics, followed by a row-level breakdown of all test cases in the dataset.
Evaluation metadata
The header area displays the following run-level details:
- AI Agent Evaluated: name and version of the agent under test
- Evaluation Method: Automatic or manual
- Evaluated On: date the evaluation was run
- Run By: user who initiated the evaluation
- Model Evaluated: the LLM used (e.g., OpenAI GPT 3.4)
- Total Time: wall-clock duration of the full evaluation run
- Total Tokens: cumulative token usage across all test cases
- Dataset: link to download the evaluation dataset
Aggregate metrics
Below the metadata, six summary metrics give an at-a-glance view of overall agent performance:
| Metric | Description | Rating Scale |
|---|---|---|
| Tool Selection | Average score for how accurately the agent selected the appropriate tools across all cases. | Good / Fair / Poor |
| Goal Completion | Pass/Fail rate indicating whether the agent successfully completed its assigned task. | Good / Fair / Poor |
| Trajectory Accuracy | Score measuring how closely the agent followed the expected execution path. | Good / Fair / Poor |
| Human Input | Percentage of cases requiring human intervention during execution. | Percentage (%) |
| Average Time Taken | Mean execution time per test case. | Hours / Minutes / Seconds |
| Average Tokens | Mean token consumption per test case. | Count |
Metric ratings (Good, Fair, Poor) are based on score thresholds: above 80% is Good, 50–79% is Fair, and below 50% is Poor. Metrics displayed in red indicate a failing result.
Output details table
The Output Details table lists individual test case results. By default, 10 rows are shown per page, with pagination controls at the bottom. The table includes the following columns:
- ID — sequential case identifier
- Tools — number of tools invoked
- Agent Output — truncated summary of the agent's final output; hover to expand
- Status — Completed or Failed execution status
- Tool Selection — per-case tool selection score (highlighted in red if below threshold)
- Human Input — Yes or No indicator
- Goal Completion — Pass or Fail
- Trajectory Accuracy — Pass or Fail (highlighted in red if failed)
- Time Taken — execution duration in seconds
- User Review — Like, Dislike, or blank if not reviewed
- Tokens Consumed — token count for the case
Rows with data validation errors display an error badge inline. Hovering over the badge reveals the error message. Cases where data could not be evaluated show N/A across all metric columns.
Select any row to open the Detailed Evaluation View for a full inspection of a single test case result. Use the breadcrumb at the top of the page to return to the summary.
Metrics breakdown
Four metric cards are displayed at the top of the detailed view. Each card shows the score or pass/fail result along with a rationale field—the LLM reasoning or evaluator notes explaining how the result was determined.
| Metric | Result Format | Rationale Field |
|---|---|---|
| Tool Selection | Score (%) + Pass/Fail | Explanation of tool selection accuracy across steps. |
| Goal Completion | Pass / Fail | Notes on whether task completion criteria were met. |
| Trajectory Accuracy | Score (0–1) | Explanation of deviation from expected path, if any. |
| Human Input | Yes / No | Indicates whether human-in-the-loop intervention occurred. |
Agent output
The full text of the agent's final output is displayed, including any LLM-generated reasoning. This section also shows the final tool call result, its execution status (Success, Failed, or Canceled), and the output variables returned.
Tool call sequence
The Tool Calls section lists every tool the agent executed during the case, in order. Each row in the table includes:
- Actual Tool Call: name and type of the tool invoked
- Type of Actual Tool Call: categorized as AI agent, API task, Human in the loop, Process, or Task bot
- Inputs: input variables passed to the tool
When an expected tool call sequence is configured for the evaluation, a comparison view shows the expected sequence alongside the actual sequence, enabling direct identification of deviations.
Dataset: inputs and outputs
The Dataset section at the bottom of the page displays the input and output variable sets for the test case.
- Agent Input Variables: variable name, type, and value for each input provided to the agent
- Agent Expected Output Variables: variable name, type, and expected value; indicates whether the actual output matched
Feedback and annotations
Reviewers can provide structured feedback directly on a test case:
- User Feedback: Like or Dislike rating with an optional comment.
- Annotations: notes attached to specific tools or steps, categorized as Observation, Error, or Suggestion.
Use the Add Annotation button to attach a note to any step in the tool call sequence. Annotations support the human-in-the-loop review cycle and are visible to other evaluators.