Evaluation Metrics for Language Models
Language models are evaluated based on their performance on different tasks such as next-token prediction, classification, generation, or question answering. Below we begin with a complete explanation of Perplexity, a fundamental metric, and then briefly cover other common metrics.
1. Perplexity (PPL)
Definition:
\[\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log p(x_i \mid x_{<i}) \right)\]Variables:
- $N$: Total number of predicted tokens
- $x_i$: The ground truth token at position $i$
- $x_{<i}$: All tokens before $i$
- $p(x_i \mid x_{<i})$: Model’s predicted probability for the ground truth token given the context
Ground Truth:
- In perplexity, the ground truth is always the next token that the model should predict.
- Example: For the sentence “The cat sat”, if input is “The”, then ground truth is “cat”.
Perplexity on a Batch:
Suppose you have a batch of $B$ sequences, each of length $T$. The model predicts the next token at each step. Perplexity is computed as:
\[\text{Perplexity}_{\text{batch}} = \exp\left(-\frac{1}{B \cdot T} \sum_{b=1}^{B} \sum_{t=1}^{T} \log p(x_t^{(b)} \mid x_{<t}^{(b)}) \right)\]Where:
- $B$: batch size
- $T$: number of predicted tokens per sequence
- $x_t^{(b)}$: the token at time step $t$ in sequence $b$
- $p(x_t^{(b)} \mid x_{<t}^{(b)})$: model’s predicted probability for the correct next token
Why Exponential?
- The inner sum is the average negative log-likelihood, which is in log-space.
- Taking $\exp$ maps it back to the original probability scale.
- This gives perplexity a more interpretable meaning: how many equally likely choices the model is guessing between on average.
Intuition:
- Perplexity measures how surprised the model is by the correct next tokens.
- Lower perplexity = better performance.
- A model that predicts uniformly over a vocabulary of size $V$ has perplexity $V$.
- If a model assigns high probability to the correct token at each step, perplexity is low.
- Common in language modeling benchmarks like WikiText or Penn Treebank.
Other Metrics (Brief Overview)
2. Accuracy
\[\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}\]- Used for classification tasks.
3. Precision, Recall, F1
\[\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision + Recall}}\]- Used when labels are imbalanced or multi-label.
4. BLEU
🔷 4.1. What is BLEU?
BLEU is an automatic evaluation metric that compares a candidate translation (generated by a model) against one or more reference translations (usually human-written). It is based on n-gram precision, with a brevity penalty to discourage overly short translations.
🔷 4.2. Core Intuition
BLEU tries to answer:
“How many phrases in the candidate translation appear in the reference translation(s)?”
It measures:
- n-gram overlap: How many unigrams, bigrams, trigrams, etc., from the candidate appear in the reference?
- Precision, not recall.
- Adjusted for sentence length via a brevity penalty.
🔷 4.3. Step-by-Step BLEU Calculation
Let’s denote:
- $C$: Candidate translation
- $R$: Reference translation(s)
📌 Step 1: n-gram Precision
For each $n \in {1, 2, …, N}$:
- Extract all n-grams from candidate and references.
- Count how many n-grams in the candidate also appear in the reference(s) (with clipping).
✅ Clipping prevents repeating n-grams in the candidate to inflate precision.
📌 Step 2: Geometric Mean of Precisions
We compute the geometric mean of n-gram precisions up to $N$-gram (commonly $N=4$):
\[\text{GeoMean} = \exp\left( \sum_{n=1}^{N} w_n \cdot \log(\text{Precision}_n) \right)\]Where:
- $w_n$: weight for each $n$ (usually $w_n = \frac{1}{N}$)
📌 Step 3: Brevity Penalty (BP)
If candidate is shorter than reference, we penalize:
\[\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp\left(1 - \frac{r}{c}\right) & \text{if } c \leq r \end{cases}\]Where:
- $c$: length of candidate
- $r$: length of reference (or closest reference if multiple)
📌 Step 4: Final BLEU Score
\[\text{BLEU} = \text{BP} \cdot \text{GeoMean}\]🔷 4.4. Example
Candidate:
“the cat is on the mat”
Reference:
“there is a cat on the mat”
Unigram Precision:
- Candidate: [the, cat, is, on, the, mat]
- Reference: [there, is, a, cat, on, the, mat]
Counts:
- the (x2) → appears once → min(2,1) = 1
- cat → in both → 1
- is → in both → 1
- on → in both → 1
- mat → in both → 1
Unigram precision:
\[P_1 = \frac{1+1+1+1+1}{6} = \frac{5}{6}\](Repeat for bigram, trigram, etc., to compute full BLEU)
🔷 4.5. Use Cases
- Machine Translation
- Text Generation
- Summarization
🔷 4.6. Pros and Cons
✅ Pros | ❌ Cons |
---|---|
Fast and simple | Ignores synonyms and meaning |
Correlates with human judgment | Sensitive to exact wording |
Standard benchmark | Not ideal for open-ended generation |
🔷 4.7. Variants
- METEOR: Uses stemming, synonyms
- ROUGE: Better for summarization (recall-based)
- CHRF: Character n-gram F-score
- BLEURT, COMET: Use neural models
Let me know if you’d like a code implementation, or comparison between BLEU and other metrics like METEOR or ROUGE.
5. ROUGE
Certainly. Here’s a deep, structured, mathematical, and intuitive explanation of ROUGE, including formulas, use cases, and how it differs from BLEU.
🔷 5.1. What is ROUGE?
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of metrics designed to evaluate automatic summaries or text generation systems by comparing them to reference summaries.
Unlike BLEU, which focuses on precision, ROUGE is centered on recall — i.e., how much of the reference text is captured by the generated output.
🔷 5.2. ROUGE Variants and Intuition
There are multiple variants of ROUGE. The main ones are:
Metric | Measures | Type |
---|---|---|
ROUGE-N | Overlap of n-grams | Recall |
ROUGE-L | Longest common subsequence (LCS) | Sequence-level |
ROUGE-W | Weighted LCS | Sequence-level |
ROUGE-S | Skip-bigram | Pair-based |
We’ll focus on the most commonly used:
- ROUGE-1 (unigram recall)
- ROUGE-2 (bigram recall)
- ROUGE-L (longest common subsequence)
🔷 5.3. Formulas and Explanations
Let:
- $C$: Candidate (generated text)
- $R$: Reference (human-written text)
🔹 A. ROUGE-N (n-gram recall)
\[\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in R} \min(\text{Count}_C(n\text{-gram}), \text{Count}_R(n\text{-gram}))}{\sum_{\text{n-gram} \in R} \text{Count}_R(n\text{-gram})}\]This is a recall-based metric:
- Numerator: n-grams that both candidate and reference have
- Denominator: total n-grams in reference
Intuition: What proportion of reference n-grams are covered by the candidate?
Example (ROUGE-1):
Reference: "the cat is on the mat"
Candidate: "the mat is clean"
Unigrams in reference: the, cat, is, on, the, mat
Unigrams in candidate: the, mat, is, clean
Overlap: the (1), is (1), mat (1)
→ count = 3
Total unigrams in reference = 6
🔹 B. ROUGE-L (Longest Common Subsequence)
\[\text{LCS}(C, R) = \text{length of the longest common subsequence}\]Then:
- Recall:
- Precision:
- F1-score:
Typically, $\beta = 1$ → harmonic mean of precision and recall.
Intuition: Measures how well the in-order structure of the reference is preserved in the candidate.
Example:
Reference: "the cat is on the mat"
Candidate: "cat on mat"
LCS: "cat on mat"
→ length = 3
ROUGE-L recall = $\frac{3}{6}$, precision = $\frac{3}{3}$
F1 = $\frac{2 \cdot 1 \cdot 0.5}{1 + 0.5} = \frac{1}{1.5} \approx 0.666$
🔷 5.4. Use Cases of ROUGE
- Text summarization (primary use case)
- Story or article generation
- Question answering
- Paraphrase generation
🔷 5.5. Comparison: BLEU vs ROUGE
Property | BLEU | ROUGE |
---|---|---|
Focus | Precision (what is in candidate) | Recall (what is recovered from reference) |
Common Use | Machine Translation | Summarization |
Metric Type | n-gram precision | n-gram recall, LCS |
Penalizes Length? | Yes (brevity penalty) | No (but lower recall if short) |
Structure Capturing | No (except n-gram continuity) | Yes (via LCS in ROUGE-L) |
🔷 5.6. Strengths and Weaknesses
✅ Pros | ❌ Cons |
---|---|
Correlates well with human judgment | Surface-level overlap only |
Fast to compute | No semantic matching |
Flexible (supports multiple metrics) | Sensitive to paraphrasing |
🔷 5.7. Summary Table
Metric | Formula/Idea | Focus | Use Case |
---|---|---|---|
ROUGE-1 | Recall of unigrams | Recall | Summarization |
ROUGE-2 | Recall of bigrams | Recall | Summarization |
ROUGE-L | LCS-based F1 (sequence match) | Structure | Story generation |
6. METEOR
🔷 6.1. What is METEOR?
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a semantic-aware evaluation metric originally developed for machine translation, but also used for summarization and other text generation tasks.
It was designed to correlate better with human judgment than BLEU and ROUGE, especially at the sentence level.
🔷 6.2. High-Level Intuition
METEOR measures the similarity between a candidate sentence and one or more reference sentences using unigram matches, but:
- Considers exact matches, stems, synonyms, and paraphrases.
- Uses harmonic mean of precision and recall, favoring recall.
- Penalizes fragmentation in the word alignment to enforce fluency and order.
BLEU focuses only on precision, ROUGE on recall, but METEOR balances both and adds semantic and structural matching.
🔷 6.3. Matching Process
Matches between candidate (C) and reference (R) are based on:
- Exact match (word = word)
- Stem match (e.g., “run” vs “running”)
- Synonym match (e.g., “car” vs “automobile”)
- Paraphrase match (optional)
Matching is done once in order of preference (exact > stem > synonym > paraphrase).
🔷 6.4. METEOR Calculation Steps
🔹 A. Unigram Precision and Recall
Let:
- $m$: Number of matched unigrams
-
$ C $: Number of unigrams in candidate -
$ R $: Number of unigrams in reference
🔹 B. Harmonic Mean (F-score)
METEOR uses a weighted harmonic mean (favoring recall):
\[F_{\text{mean}} = \frac{10 \cdot P \cdot R}{R + 9 \cdot P}\]This weights recall 10× more than precision by default (can be tuned).
🔹 C. Fragmentation Penalty
Let:
- $ch$: Number of contiguous chunks of matched words (in correct order)
- $m$: Total matches
Then penalty:
\[\text{Penalty} = 0.5 \cdot \left( \frac{ch}{m} \right)^{3}\]More chunks → more fragmented → higher penalty
🔹 D. Final METEOR Score
\[\text{METEOR} = F_{\text{mean}} \cdot (1 - \text{Penalty})\]🔷 6.5. Example
Reference: "the cat is on the mat"
Candidate: "the mat is on the cat"
Matched unigrams: the, cat, is, on, mat
→ $m = 5$
Unigram precision $P = \frac{5}{6}$, recall $R = \frac{5}{6}$
Assume the matches are broken into 3 contiguous chunks: → $ch = 3, m = 5$
\[\text{Penalty} = 0.5 \cdot \left( \frac{3}{5} \right)^3 \approx 0.216\] \[\text{METEOR} = 0.833 \cdot (1 - 0.216) \approx 0.653\]🔷 6.6. Advantages Over BLEU/ROUGE
Feature | METEOR | BLEU / ROUGE |
---|---|---|
Recall-sensitive | ✅ Yes | BLEU ❌ (precision), ROUGE ✅ |
Synonyms/stemming | ✅ Yes | ❌ No |
Chunk penalty | ✅ Yes | ❌ No |
Word order | ✅ Yes (via penalty) | BLEU partial, ROUGE-L only |
Sentence-level use | ✅ Good correlation | BLEU: bad at sentence level |
🔷 6.7. When to Use METEOR
- Machine translation tasks
- Text summarization
- Sentence-level generation evaluation
- When you want semantic similarity and meaning-aware evaluation
🔷 6.8. Comparison Summary
Metric | Precision | Recall | Word Order | Semantics | Best Use |
---|---|---|---|---|---|
BLEU | ✅ | ❌ | Partial | ❌ | Translation (corpora) |
ROUGE | ❌ | ✅ | LCS-based | ❌ | Summarization |
METEOR | ✅ | ✅ | ✅ (chunks) | ✅ | Sentence-level tasks |
Would you like me to:
- Implement METEOR in Python?
- Compare METEOR and BERTScore side-by-side?
- Evaluate example sentences with each metric?
7. BERTScore
BERTScore is a modern, semantic similarity metric for evaluating text generation models like machine translation, summarization, captioning, etc. It uses deep contextual embeddings (from BERT or similar transformers) to compare generated sentences to reference sentences, at the semantic level, not just surface-level n-gram overlap.
🔷 7.1. Why BERTScore?
Traditional metrics like BLEU, ROUGE, and METEOR:
- Rely on exact token overlap.
- Struggle with synonyms, paraphrasing, and semantic equivalence.
- Often fail to reflect human judgment for fluent but worded-differently sentences.
BERTScore overcomes these by using pretrained language models to evaluate semantic similarity of words in context.
🔷 7.2. Key Intuition
Instead of matching n-grams or counting word overlap, BERTScore:
- Encodes candidate and reference sentences using BERT (or similar transformer).
- Compares embeddings of words in candidate and reference via cosine similarity.
- Uses precision, recall, and F1 based on these similarities.
🔷 7.3. Mathematical Formulation
Let:
- $\mathbf{C} = [\mathbf{c}_1, …, \mathbf{c}_m]$: embeddings of candidate words
- $\mathbf{R} = [\mathbf{r}_1, …, \mathbf{r}_n]$: embeddings of reference words
All embeddings come from a contextual encoder (like BERT), so they’re context-aware.
🔹 Step 1: Cosine Similarity Matrix
Compute a similarity matrix $S \in \mathbb{R}^{m \times n}$:
\[S_{ij} = \cos(\mathbf{c}_i, \mathbf{r}_j)\]This measures how similar each word in the candidate is to each word in the reference.
🔹 Step 2: Precision, Recall
- Precision: For each word in candidate, find the max similarity with any reference word.
- Recall: For each word in reference, find max similarity with any candidate word.
🔹 Step 3: F1 Score
\[\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]This F1 is the BERTScore, and can be averaged over many sentence pairs.
🔷 7.4. Example
Reference: | “A quick brown fox jumps over the lazy dog” |
---|---|
Candidate: | “A fast dark fox leaps over a sleepy dog” |
Traditional metrics would score low due to low n-gram overlap. BERTScore would score high because:
- “quick” ≈ “fast”
- “jumps” ≈ “leaps”
- “lazy” ≈ “sleepy”
🔷 7.5. Features of BERTScore
Feature | Description |
---|---|
Uses contextual embeddings | Understands words in context |
Token-level matching | Matches words even if paraphrased |
Supports multilingual BERT | For cross-lingual tasks |
Tunable models | You can use RoBERTa, DeBERTa, etc. |
🔷 7.6. Strengths and Weaknesses
✅ Pros | ❌ Cons |
---|---|
Strong correlation with human judgment | Slower than BLEU/ROUGE |
Handles synonyms/paraphrases | Depends on quality of language model |
Context-aware comparison | Harder to interpret scores |
Good for long, fluent sentences | May overestimate fluency |
🔷 7.7. Comparison to BLEU/ROUGE/METEOR
Metric | Surface Match | Semantics | Context-Aware | Speed | Interpretable |
---|---|---|---|---|---|
BLEU | ✅ | ❌ | ❌ | ✅ Fast | ✅ (to a degree) |
ROUGE | ✅ | ❌ | ❌ | ✅ Fast | ✅ |
METEOR | ✅ + stems/syns | Limited | ❌ | ✅ | Moderate |
BERTScore | ❌ | ✅ | ✅ | ❌ Slower | ❌ |
🔷 7.8. Code Snippet (Python)
from bert_score import score
candidate = ["A fast dark fox leaps over a sleepy dog"]
reference = ["A quick brown fox jumps over the lazy dog"]
P, R, F1 = score(candidate, reference, lang="en", model_type="bert-base-uncased")
print(f"Precision: {P.item():.4f}, Recall: {R.item():.4f}, F1: {F1.item():.4f}")
8. Exact Match (EM)
Exact Match (EM) is a binary evaluation metric that checks whether a model’s output exactly matches the reference answer — character for character or token for token — after normalization.
🔷 8.1. What Is Exact Match (EM)?
Let:
- $C$: Candidate/generated output
- $R$: Reference answer
Then over a dataset of $N$ samples:
\[\text{Exact Match Score} = \frac{1}{N} \sum_{i=1}^{N} \text{EM}(C_i, R_i)\]It is often expressed as a percentage.
🔷 8.2. Where It’s Used
Task | EM Usage |
---|---|
Question Answering | Does the predicted answer match exactly? |
Span Extraction (SQuAD) | Is the predicted span identical to the correct one? |
Code generation | Is the predicted code identical to the ground truth? |
🔷 8.3. Normalization Rules
To avoid penalizing trivial differences, normalization is usually applied:
Common Steps:
- Lowercase both strings
- Remove articles (a, an, the)
- Remove punctuation
- Remove extra whitespace
Example:
Candidate | Reference |
---|---|
“The Eiffel Tower” | “eiffel tower” |
→ After normalization: both become "eiffel tower"
→ EM = 1
🔷 8.4. Strengths and Weaknesses
✅ Pros | ❌ Cons |
---|---|
Simple and interpretable | Extremely strict |
Used in exact span tasks (e.g. SQuAD) | No partial credit |
No tuning or hyperparameters | Doesn’t account for synonyms/paraphrasing |
🔷 8.5. Comparison to Other Metrics
Metric | Partial Credit | Semantic Understanding | Use Case |
---|---|---|---|
Exact Match | ❌ No | ❌ No | Span-based QA, Code QA |
F1 (token overlap) | ✅ Yes | ❌ No | QA with multiple valid forms |
BLEU/ROUGE | ✅ Yes | ❌ Limited | Translation, summarization |
BERTScore | ✅ Yes | ✅ Yes | Semantic text generation tasks |
🔷 8.6. When To Use
- When only one correct answer is expected.
- When evaluating retrieval or span selection tasks.
- As a complementary metric with F1 or BLEU to gauge how often the model is perfectly right.
9. Token-Level Accuracy
Token-Level Accuracy is a fine-grained metric that measures how many individual tokens (words, subwords, or characters) in the model’s output match the reference tokens at the same positions.
It is useful when partial correctness matters — for example, in:
- text classification
- sequence labeling (like Named Entity Recognition)
- translation
- code generation
🔷 9.1. Definition
Let:
- $\mathbf{C} = [c_1, c_2, …, c_n]$: Candidate tokens
- $\mathbf{R} = [r_1, r_2, …, r_n]$: Reference tokens
Then:
\[\text{Token-Level Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(c_i = r_i)\]Where:
- $\mathbb{1}(c_i = r_i)$ is 1 if token $i$ matches, 0 otherwise
- $n$: Total number of tokens in the reference
✅ Example 1 (Machine Translation):
Reference: ["the", "cat", "is", "black"]
Candidate: ["the", "dog", "is", "black"]
→ Match: the
, is
, black
→ 3 out of 4
→ Token Accuracy = $\frac{3}{4} = 0.75$
✅ Example 2 (NER):
Reference Tags: ["O", "B-PER", "I-PER", "O"]
Predicted Tags: ["O", "B-PER", "O", "O"]
→ Match: 3 out of 4
→ Token Accuracy = $0.75$
🔷 9.2. Applications
Task | Token Match Used For |
---|---|
Named Entity Recognition | Match predicted tag per token |
Part-of-Speech Tagging | Match POS per word |
Machine Translation | Match target tokens |
Code Generation | Match tokens in predicted code |
Autocomplete / LM Tasks | Predict next token |
🔷 9.3. Strengths and Limitations
✅ Pros | ❌ Cons |
---|---|
Fine-grained feedback | Ignores sentence structure |
Useful for sequence tasks | Can reward grammatically incorrect but token-matching outputs |
Easy to compute and interpret | Doesn’t reflect semantic correctness |
🔷 9.4. Comparison Table
Metric | Matches… | Position Sensitive? | Partial Credit | Semantic? |
---|---|---|---|---|
Exact Match | Whole sentence | ✅ Yes | ❌ No | ❌ No |
Token Accuracy | Individual tokens | ✅ Yes | ✅ Yes | ❌ No |
BLEU/ROUGE | n-gram overlap | ✅ Partially | ✅ Yes | ❌/🔸 Partial |
BERTScore | Token embeddings | ❌ No | ✅ Yes | ✅ Yes |
10. Pass\@k
Pass\@k (pronounced “pass at k”) is a code generation evaluation metric that measures how often at least one out of $k$ generated code samples solves the problem (i.e., passes all test cases).
🔷 10.1. Context: Why Pass\@k?
When language models (like Codex or GPT) are asked to generate code, their output may vary with sampling. Instead of judging just the first try, Pass\@k answers:
If I sample $k$ solutions, what’s the probability that at least one passes all the test cases?
It’s widely used in code generation benchmarks like:
- HumanEval (OpenAI)
- MBPP (Google)
🔷 10.2. Definition
Let:
- $n$: Total number of samples generated for a problem
- $k$: Number of allowed guesses (usually $k = 1, 5, 10, …$)
- $c$: Number of correct samples (i.e., those that pass all test cases)
If $n \geq k$, the estimated Pass\@k is:
\[\text{Pass@}k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}\]Intuition:
- $\binom{n - c}{k}$: Number of ways to choose $k$ incorrect samples
- $\binom{n}{k}$: Number of total ways to choose $k$ samples
- So: this is the probability that all $k$ are incorrect, and we subtract from 1 to get the chance that at least one is correct
🔷 10.3. Simplified Cases
- If $c = 0$ → no correct solutions → $\text{Pass@}k = 0$
- If $c = n$ → all correct → $\text{Pass@}k = 1$
- If $k = 1$ → reduces to accuracy over $n$ samples
🔷 10.4. When to Use It
Use Case | Why Pass\@k is Helpful |
---|---|
Code generation with sampling | Reflects the benefit of retries |
Evaluating creativity in LMs | Models may succeed on retries |
Human-in-the-loop coding | Models are used interactively |
🔷 10.5. Comparison to Other Metrics
Metric | Measures… | Partial Credit | Uses Sampling |
---|---|---|---|
Accuracy | First-sample correctness | ❌ No | ❌ No |
Token Accuracy | Token-level match | ✅ Yes | ❌ No |
BLEU | Textual overlap (not correctness) | ✅ Yes | ❌ No |
Pass\@k | At least one correct in k tries | ✅ Yes (at k) | ✅ Yes |
🔷 10.6. Python Implementation
from math import comb
def pass_at_k(n, c, k):
if c == 0 or k > n:
return 0.0
return 1.0 - (comb(n - c, k) / comb(n, k))
🔷 10.7. Example
Say for a coding problem:
- You generate $n = 10$ samples
- $c = 3$ of them pass all tests
- You evaluate Pass\@5
→ 91.6% chance that at least one of 5 samples solves the problem.
🔷 10.8. Notes on Use in Benchmarks
- HumanEval by OpenAI uses Pass\@1, Pass\@10, etc.
- Often averaged over all problems.
- Typically assumes deterministic unit testing for correctness.
Let me know if you want:
- Full simulation over a dataset
- Extension to top-k diversity
- Comparison to metrics like CodeBLEU or Exact Match for code.