Gain insight into AI performance with deeper understanding of metrics and dimensions of scoring.

Each evaluation returns scores for the quality of performance. When using Evaluate automatically there are a variety of metrics the LLM uses to judge and return scores.

Metrics

Evaluations are scored from the four key task types of AI capabilities. Each type has corresponding dimensions and metrics to deliver scores for the evaluation. An evaluation is fully completed when every dimension is resolved.

Table 1.
Metric Metric Type Definition
Summarization Factual consistency LLM as a Judge The factual alignment between the summary and the summarized source.
Completeness LLM as a Judge Does the summary capture key points.
Bleu-2 NLP This metric measure precision of bigrams (sequences of two consecutive words) in the LLM output that match the reference text.
Rouge-2 NLP This metric measures recall of bigrams (sequences of two consecutive words) from the reference text that appear in the LLM output.
Text generation Answer Relevance LLM as a Judge How relevant is the LLM output compared to the provided input?
Hallucination LLM as a Judge Whether LLM generates factually correct information by comparing the actual output to the provided context.
Bleu-2 NLP Answer's alignment with ground truth.
Rouge-2 NLP How relevant the LLM output is compared to the provided input?
Text extraction Answer relevance LLM as a Judge How relevant is the LLM output compared to the provided input?
Hallucination LLM as a Judge Whether LLM generates factually correct information by comparing the actual output to the provided context.
Ground truth equivalence LLM as a Judge Answer alignment with ground truth.
Text classification Correctness LLM as a Judge Is the predicted label correct.
Exact match NLP The metric checks for an exact match between the expected output and actual output.
Quasi exact match NLP This metric checks for an exact match between the expected output and actual output after normalizing them by lower-casing, removing punctuation and articles, and stripping extra white space.