Automation 360

Metrics for AI Evaluations

Download as PDF

Metrics for AI Evaluations

Download as PDF

Version:
Updated: 2025/11/13

Automation 360

Gain insight into AI performance with deeper understanding of metrics and dimensions of scoring.

Each evaluation returns scores for the quality of performance. When using Evaluate automatically there are a variety of metrics the LLM uses to judge and return scores.

Metrics

Evaluations are scored from the four key task types of AI capabilities. Each type has corresponding dimensions and metrics to deliver scores for the evaluation. An evaluation is fully completed when every dimension is resolved.

Table 1.
Metric	Metric	Type	Definition
Summarization	Factual consistency	LLM as a Judge	The factual alignment between the summary and the summarized source.
	Completeness	LLM as a Judge	Does the summary capture key points.
	Bleu-2	NLP	This metric measure precision of bigrams (sequences of two consecutive words) in the LLM output that match the reference text.
	Rouge-2	NLP	This metric measures recall of bigrams (sequences of two consecutive words) from the reference text that appear in the LLM output.
Text generation	Answer Relevance	LLM as a Judge	How relevant is the LLM output compared to the provided input?
	Hallucination	LLM as a Judge	Whether LLM generates factually correct information by comparing the actual output to the provided context.
	Bleu-2	NLP	Answer's alignment with ground truth.
	Rouge-2	NLP	How relevant the LLM output is compared to the provided input?
Text extraction	Answer relevance	LLM as a Judge	How relevant is the LLM output compared to the provided input?
	Hallucination	LLM as a Judge	Whether LLM generates factually correct information by comparing the actual output to the provided context.
	Ground truth equivalence	LLM as a Judge	Answer alignment with ground truth.
Text classification	Correctness	LLM as a Judge	Is the predicted label correct.
	Exact match	NLP	The metric checks for an exact match between the expected output and actual output.
	Quasi exact match	NLP	This metric checks for an exact match between the expected output and actual output after normalizing them by lower-casing, removing punctuation and articles, and stripping extra white space.