AI Evaluations
- Updated: 2025/12/18
AI Evaluations enables admin to track and rate the output of generative AI capabilities.
AI Evaluations overview

AI Evaluations is a tool designed for assessing the characteristics and capabilities of generative AI(genAI) systems. This includes metrics and methodologies to quantify and qualify aspects such as performance, robustness, fairness, safety, interpretability, and alignment with intended objectives and ethical principles. AI Evaluations is designed for pro-developers to evaluate and qualify AI Skills during the design phase, ensuring they meet the required standards.
With the growing adoption of generative AI, there is a pressing need for tools that assess model quality prior to organizational deployment and scaling. AI Evaluations fulfill this requirement by offering a simple and intuitive interface that accelerates the assessment process. By conducting thorough evaluations, users can mitigate risks associated with degraded model performance and compromised quality, ensuring reliable AI solutions.

These evaluations leverage Natural Language Processing (NLP) and Large Language Models (LLMs) to judge and deliver scores, providing insights on how to improve AI systems. The evaluation process employs a systematic approach, utilizing NLP metrics and research-driven insights to conduct detailed assessments. This involves comparing model outputs to desired outcomes, monitoring for performance drifts, and prompting revisions when necessary. Continuous refinement ensures that AI models remain effective and optimized for user needs.
AI Evaluations landing page
- Find completed evaluations under the Evaluations tab.
- Click the evaluation Name to display insights from your completed evaluation.
Key concepts
Output Comparison: A key feature of AI Evaluations is the ability to compare the output generated by a language model to a predefined, desired output. This ensures alignment with specified criteria and standards, optimizing the relevance and accuracy of the generated content. A data set or expected output can be uploaded or manually input to use for these comparisons.
Simultaneous Evaluation in AI Skill Development: As AI Skills evolve, evaluations can be conducted concurrently with model development. This iterative process allows for real-time adjustments and improvements, fostering the dynamic enhancement of AI capabilities.
Metrics and Research Insights: The evaluation process is underpinned by industry standards for NLP metrics and comprehensive machine learning research. These metrics provide the framework that deliver valuable insights into the benefits and performance of AI models. In instances where variables are incomplete, the system prompts a revision and rerun of the evaluation, ensuring thoroughness and precision.
The computation of these NLP metrics relies on the expected outputs that you provide during configuration of the evaluation run. In cases where expected outputs are not available in the evaluation, the LLM-as-a-judge uses predefined metrics to deliver scores.
- Evaluations can be Run automatically, using the system to compare source and output performance. See, Run an AI Evaluation automatically.
- The option to Run manually is available for users to make the comparison. See, Run AI Evaluations manually.
Evaluation criteria
This criteria has been researched and selected, based on the ability of this data to support highly functional solutions. Scores in these areas help decision makers identify improvements that impact the quality and effectiveness that genAI solutions provide.
| Principle | Description | Use cases |
|---|---|---|
| Summarization | Ability to offer complete and factual alignment between the output and source. | Analysis, content moderation |
| Text generation | Relevance and accuracy of text provided from AI compared to the source information. | Customer feedback, financial documents |
| Text extraction | Validate text is aligned, using ground truth data and comparing the provided inputs. | Question and answering, Info extraction |
| Text classification | Verifies the categories of subjects between output and source. | Research |
Audit logs
Admin can view session and event details for each completed evaluation in AI Governance. See, AI Governance.
Permissions and access
Admin can enable AI Evaluations by selecting permissions on the role page for respective users. These permissions are essential for managing access and functionality related to AI Evaluations.
Permissions:
-
View AI Evaluations: This permission allows users to view AI Evaluations scores and reasoning. Access is limited to folders and AI Skills that the user is assigned (example: public).
-
Manage AI Evaluations: This permission is required for users to run evaluations and manage data sets.