AI Evaluations enables admin to track and rate the output of generative AI capabilities.

AI Evaluations overview


Graphic describes benefits that AI Evaluations offers.

Note: For best results, ensure you are using AI Skill package version 9.0.0.

AI Evaluations is a tool designed for assessing the characteristics and capabilities of generative AI(genAI) systems. This includes metrics and methodologies to quantify and qualify aspects such as performance, robustness, fairness, safety, interpretability, and alignment with intended objectives and ethical principles. AI Evaluations is designed for pro-developers to evaluate and qualify AI Skills during the design phase, ensuring they meet the required standards.

With the growing adoption of generative AI, there is a pressing need for tools that assess model quality prior to organizational deployment and scaling. AI Evaluations fulfill this requirement by offering a simple and intuitive interface that accelerates the assessment process. By conducting thorough evaluations, users can mitigate risks associated with degraded model performance and compromised quality, ensuring reliable AI solutions.

The following diagram illustrates how an evaluation is performed.
Diagram demonstrates several operations performed in an evaluation.

These evaluations leverage Natural Language Processing (NLP) and Large Language Models (LLMs) to judge and deliver scores, providing insights on how to improve AI systems. The evaluation process employs a systematic approach, utilizing NLP metrics and research-driven insights to conduct detailed assessments. This involves comparing model outputs to desired outcomes, monitoring for performance drifts, and prompting revisions when necessary. Continuous refinement ensures that AI models remain effective and optimized for user needs.

AI Evaluations landing page

Evaluations are seamlessly integrated into development. A centralized, user-friendly interface provides access to scores, metrics and evaluation tools, reducing the need for specialized machine learning expertise. This accessibility ensures users can efficiently conduct evaluations and optimize AI models, as needed.
  • Find completed evaluations under the Evaluations tab.
  • Click the evaluation Name to display insights from your completed evaluation.

Key concepts

Output Comparison: A key feature of AI Evaluations is the ability to compare the output generated by a language model to a predefined, desired output. This ensures alignment with specified criteria and standards, optimizing the relevance and accuracy of the generated content. A data set or expected output can be uploaded or manually input to use for these comparisons.

Simultaneous Evaluation in AI Skill Development: As AI Skills evolve, evaluations can be conducted concurrently with model development. This iterative process allows for real-time adjustments and improvements, fostering the dynamic enhancement of AI capabilities.

Metrics and Research Insights: The evaluation process is underpinned by industry standards for NLP metrics and comprehensive machine learning research. These metrics provide the framework that deliver valuable insights into the benefits and performance of AI models. In instances where variables are incomplete, the system prompts a revision and rerun of the evaluation, ensuring thoroughness and precision.

The computation of these NLP metrics relies on the expected outputs that you provide during configuration of the evaluation run. In cases where expected outputs are not available in the evaluation, the LLM-as-a-judge uses predefined metrics to deliver scores.

Evaluator and Metrics Origin: The evaluator functions as the mechanism for assessing AI outputs, drawing metrics from industry standards. When LLM-as-a-judge is used, this mechanism is research based to ensure that evaluations take a human-like and comprehensive approach.

Evaluation criteria

This criteria has been researched and selected, based on the ability of this data to support highly functional solutions. Scores in these areas help decision makers identify improvements that impact the quality and effectiveness that genAI solutions provide.

Evaluations focus on four key task types of AI capabilities that are essential for common use cases. Each use case is categorized to match the tasks and predefined metrics necessary to deliver scores and insights of performance. For more details, see Metrics for AI Evaluations.
Table 1.
Principle Description Use cases
Summarization Ability to offer complete and factual alignment between the output and source. Analysis, content moderation
Text generation Relevance and accuracy of text provided from AI compared to the source information. Customer feedback, financial documents
Text extraction Validate text is aligned, using ground truth data and comparing the provided inputs. Question and answering, Info extraction
Text classification Verifies the categories of subjects between output and source. Research

Audit logs

Admin can view session and event details for each completed evaluation in AI Governance. See, AI Governance.

Permissions and access

Admin can enable AI Evaluations by selecting permissions on the role page for respective users. These permissions are essential for managing access and functionality related to AI Evaluations.

Permissions:

  • View AI Evaluations: This permission allows users to view AI Evaluations scores and reasoning. Access is limited to folders and AI Skills that the user is assigned (example: public).

  • Manage AI Evaluations: This permission is required for users to run evaluations and manage data sets.