AI Evaluations
- Updated: 2025/07/10
AI Evaluations enables admin to track and rate the output of generative AI capabilities.
AI Evaluations overview
AI Evaluations are tailored for pro-developers and automation administrators. Pro-developers use these tools to evaluate AI Skills and AI Agents during the design phase, ensuring they meet the required standards. Automation administrators use AI Evaluations to enforce AI Guardrails during runtime, preventing unintended outcomes and maintaining control over automations.
These evaluations provide a robust set of tools for assessing the quality and performance of Large Language Models (LLMs), such as GPT-4o. They support summarization, classification, and real-time output comparison, ensuring alignment with user expectations and industry standards. Multilayer perceptron (MLP) metrics and machine learning insights provide the backbone of this comprehensive evaluation framework.
With the growing adoption of generative AI, there is a pressing need for tools that assess model quality prior to organizational deployment and scaling. AI Evaluations fulfill this requirement by offering a simple and intuitive interface that accelerates the assessment process. By conducting thorough evaluations, users can mitigate risks associated with degraded model performance and compromised quality, ensuring reliable AI solutions.
The evaluation process employs a systematic approach, utilizing MLP metrics and research-driven insights to deliver detailed assessments. It involves comparing model outputs to desired outcomes, monitoring for performance drifts, and prompting revisions when necessary. This continuous refinement ensures that AI models remain effective and optimized for user needs.
When and where to apply evaluations
AI Evaluations are applicable at various stages of AI development and deployment. During the design phase, pro-developers can use evaluations to refine AI Skills and Agents. At runtime, automation administrators enforce guardrails and monitor performance to detect and address deviations from expected outcomes. This ongoing evaluation process ensures the sustained reliability and effectiveness of AI models.
These evaluations are seamlessly integrated into the development environments used by developers and the automation platforms managed by administrators. They offer a centralized, user-friendly web interface that provides access to key metrics and evaluation tools, eliminating the need for specialized machine learning expertise. This accessibility ensures that users can efficiently conduct evaluations and optimize their AI models.
Key concepts
Output Comparison: A key feature of AI Evaluations is the ability to compare the output generated by a language model to a predefined desired output. This ensures alignment with specified criteria and standards, optimizing the relevance and accuracy of the generated content.
Simultaneous Evaluation in AI Skill Development: As AI skills evolve, evaluations can be conducted concurrently with model development. This continuous evaluation process allows for real-time adjustments and improvements, fostering the dynamic enhancement of AI capabilities.
Integration with AI Agents: AI Evaluations are poised to play a pivotal role in the development of AI agents. Once these evaluations demonstrate success, they will be integrated to bolster the functionality and effectiveness of AI agents.
Metrics and Research Insights: The evaluation process is underpinned by MLP metrics and comprehensive machine learning research. These metrics provide valuable insights into the benefits and performance of AI models. In instances where variables are incomplete, the system prompts a revision and rerun of the evaluation, ensuring thoroughness and precision.
Evaluator and Metrics Origin: The evaluator functions as the mechanism for assessing AI outputs, drawing metrics from extensive research and empirical data. This ensures that evaluations are grounded in reliable and validated information.
Permissions and access
Permissions:
-
View AI Evaluations: This permission allows users to view AI Evaluations and configurations. Users with access to specific AI Skills will be able to view the evaluations associated with those skills.
-
Manage AI Evaluations: This permission grants users the ability to create evaluations and manage data sets. It is crucial for users who need to actively engage in the evaluation process.
These features are designed to be available exclusively for Enterprise customers, ensuring that organizations with advanced needs can fully leverage the capabilities of AI Evaluations.