Home

GenAI

Evaluation Metrics for Language Models

Language models are evaluated based on their performance on different tasks such as next-token prediction, classification, generation, or question answering. Below we begin with a complete explanation of Perplexity, a fundamental metric, and then briefly cover other common metrics.


1. Perplexity (PPL)

Definition:

\[\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log p(x_i \mid x_{<i}) \right)\]

Variables:

  • $N$: Total number of predicted tokens
  • $x_i$: The ground truth token at position $i$
  • $x_{<i}$: All tokens before $i$
  • $p(x_i \mid x_{<i})$: Model’s predicted probability for the ground truth token given the context

Ground Truth:

  • In perplexity, the ground truth is always the next token that the model should predict.
  • Example: For the sentence “The cat sat”, if input is “The”, then ground truth is “cat”.

Perplexity on a Batch:

Suppose you have a batch of $B$ sequences, each of length $T$. The model predicts the next token at each step. Perplexity is computed as:

\[\text{Perplexity}_{\text{batch}} = \exp\left(-\frac{1}{B \cdot T} \sum_{b=1}^{B} \sum_{t=1}^{T} \log p(x_t^{(b)} \mid x_{<t}^{(b)}) \right)\]

Where:

  • $B$: batch size
  • $T$: number of predicted tokens per sequence
  • $x_t^{(b)}$: the token at time step $t$ in sequence $b$
  • $p(x_t^{(b)} \mid x_{<t}^{(b)})$: model’s predicted probability for the correct next token

Why Exponential?

  • The inner sum is the average negative log-likelihood, which is in log-space.
  • Taking $\exp$ maps it back to the original probability scale.
  • This gives perplexity a more interpretable meaning: how many equally likely choices the model is guessing between on average.

Intuition:

  • Perplexity measures how surprised the model is by the correct next tokens.
  • Lower perplexity = better performance.
  • A model that predicts uniformly over a vocabulary of size $V$ has perplexity $V$.
  • If a model assigns high probability to the correct token at each step, perplexity is low.
  • Common in language modeling benchmarks like WikiText or Penn Treebank.

Other Metrics (Brief Overview)

2. Accuracy

\[\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}\]
  • Used for classification tasks.

3. Precision, Recall, F1

\[\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision + Recall}}\]
  • Used when labels are imbalanced or multi-label.

4. BLEU

🔷 4.1. What is BLEU?

BLEU is an automatic evaluation metric that compares a candidate translation (generated by a model) against one or more reference translations (usually human-written). It is based on n-gram precision, with a brevity penalty to discourage overly short translations.


🔷 4.2. Core Intuition

BLEU tries to answer:

“How many phrases in the candidate translation appear in the reference translation(s)?”

It measures:

  • n-gram overlap: How many unigrams, bigrams, trigrams, etc., from the candidate appear in the reference?
  • Precision, not recall.
  • Adjusted for sentence length via a brevity penalty.

🔷 4.3. Step-by-Step BLEU Calculation

Let’s denote:

  • $C$: Candidate translation
  • $R$: Reference translation(s)

📌 Step 1: n-gram Precision

For each $n \in {1, 2, …, N}$:

  • Extract all n-grams from candidate and references.
  • Count how many n-grams in the candidate also appear in the reference(s) (with clipping).
\[\text{Precision}_n = \frac{\sum_{\text{n-gram} \in C} \min(\text{Count}_C(n\text{-gram}), \text{MaxRefCount}(n\text{-gram}))}{\sum_{\text{n-gram} \in C} \text{Count}_C(n\text{-gram})}\]

✅ Clipping prevents repeating n-grams in the candidate to inflate precision.


📌 Step 2: Geometric Mean of Precisions

We compute the geometric mean of n-gram precisions up to $N$-gram (commonly $N=4$):

\[\text{GeoMean} = \exp\left( \sum_{n=1}^{N} w_n \cdot \log(\text{Precision}_n) \right)\]

Where:

  • $w_n$: weight for each $n$ (usually $w_n = \frac{1}{N}$)

📌 Step 3: Brevity Penalty (BP)

If candidate is shorter than reference, we penalize:

\[\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp\left(1 - \frac{r}{c}\right) & \text{if } c \leq r \end{cases}\]

Where:

  • $c$: length of candidate
  • $r$: length of reference (or closest reference if multiple)

📌 Step 4: Final BLEU Score

\[\text{BLEU} = \text{BP} \cdot \text{GeoMean}\]

🔷 4.4. Example

Candidate:

“the cat is on the mat”

Reference:

“there is a cat on the mat”

Unigram Precision:

  • Candidate: [the, cat, is, on, the, mat]
  • Reference: [there, is, a, cat, on, the, mat]

Counts:

  • the (x2) → appears once → min(2,1) = 1
  • cat → in both → 1
  • is → in both → 1
  • on → in both → 1
  • mat → in both → 1

Unigram precision:

\[P_1 = \frac{1+1+1+1+1}{6} = \frac{5}{6}\]

(Repeat for bigram, trigram, etc., to compute full BLEU)


🔷 4.5. Use Cases

  • Machine Translation
  • Text Generation
  • Summarization

🔷 4.6. Pros and Cons

✅ Pros ❌ Cons
Fast and simple Ignores synonyms and meaning
Correlates with human judgment Sensitive to exact wording
Standard benchmark Not ideal for open-ended generation

🔷 4.7. Variants

  • METEOR: Uses stemming, synonyms
  • ROUGE: Better for summarization (recall-based)
  • CHRF: Character n-gram F-score
  • BLEURT, COMET: Use neural models

Let me know if you’d like a code implementation, or comparison between BLEU and other metrics like METEOR or ROUGE.

5. ROUGE

Certainly. Here’s a deep, structured, mathematical, and intuitive explanation of ROUGE, including formulas, use cases, and how it differs from BLEU.


🔷 5.1. What is ROUGE?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of metrics designed to evaluate automatic summaries or text generation systems by comparing them to reference summaries.

Unlike BLEU, which focuses on precision, ROUGE is centered on recall — i.e., how much of the reference text is captured by the generated output.


🔷 5.2. ROUGE Variants and Intuition

There are multiple variants of ROUGE. The main ones are:

Metric Measures Type
ROUGE-N Overlap of n-grams Recall
ROUGE-L Longest common subsequence (LCS) Sequence-level
ROUGE-W Weighted LCS Sequence-level
ROUGE-S Skip-bigram Pair-based

We’ll focus on the most commonly used:

  • ROUGE-1 (unigram recall)
  • ROUGE-2 (bigram recall)
  • ROUGE-L (longest common subsequence)

🔷 5.3. Formulas and Explanations

Let:

  • $C$: Candidate (generated text)
  • $R$: Reference (human-written text)

🔹 A. ROUGE-N (n-gram recall)

\[\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in R} \min(\text{Count}_C(n\text{-gram}), \text{Count}_R(n\text{-gram}))}{\sum_{\text{n-gram} \in R} \text{Count}_R(n\text{-gram})}\]

This is a recall-based metric:

  • Numerator: n-grams that both candidate and reference have
  • Denominator: total n-grams in reference

Intuition: What proportion of reference n-grams are covered by the candidate?

Example (ROUGE-1):

Reference: "the cat is on the mat" Candidate: "the mat is clean"

Unigrams in reference: the, cat, is, on, the, mat Unigrams in candidate: the, mat, is, clean

Overlap: the (1), is (1), mat (1) → count = 3 Total unigrams in reference = 6

\[\text{ROUGE-1} = \frac{3}{6} = 0.5\]

🔹 B. ROUGE-L (Longest Common Subsequence)

\[\text{LCS}(C, R) = \text{length of the longest common subsequence}\]

Then:

  • Recall:
\[R_{\text{LCS}} = \frac{\text{LCS}(C, R)}{|R|}\]
  • Precision:
\[P_{\text{LCS}} = \frac{\text{LCS}(C, R)}{|C|}\]
  • F1-score:
\[\text{ROUGE-L} = \frac{(1 + \beta^2) \cdot P_{\text{LCS}} \cdot R_{\text{LCS}}}{R_{\text{LCS}} + \beta^2 \cdot P_{\text{LCS}}}\]

Typically, $\beta = 1$ → harmonic mean of precision and recall.

Intuition: Measures how well the in-order structure of the reference is preserved in the candidate.

Example:

Reference: "the cat is on the mat" Candidate: "cat on mat"

LCS: "cat on mat" → length = 3 ROUGE-L recall = $\frac{3}{6}$, precision = $\frac{3}{3}$ F1 = $\frac{2 \cdot 1 \cdot 0.5}{1 + 0.5} = \frac{1}{1.5} \approx 0.666$


🔷 5.4. Use Cases of ROUGE

  • Text summarization (primary use case)
  • Story or article generation
  • Question answering
  • Paraphrase generation

🔷 5.5. Comparison: BLEU vs ROUGE

Property BLEU ROUGE
Focus Precision (what is in candidate) Recall (what is recovered from reference)
Common Use Machine Translation Summarization
Metric Type n-gram precision n-gram recall, LCS
Penalizes Length? Yes (brevity penalty) No (but lower recall if short)
Structure Capturing No (except n-gram continuity) Yes (via LCS in ROUGE-L)

🔷 5.6. Strengths and Weaknesses

✅ Pros ❌ Cons
Correlates well with human judgment Surface-level overlap only
Fast to compute No semantic matching
Flexible (supports multiple metrics) Sensitive to paraphrasing

🔷 5.7. Summary Table

Metric Formula/Idea Focus Use Case
ROUGE-1 Recall of unigrams Recall Summarization
ROUGE-2 Recall of bigrams Recall Summarization
ROUGE-L LCS-based F1 (sequence match) Structure Story generation

6. METEOR

🔷 6.1. What is METEOR?

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a semantic-aware evaluation metric originally developed for machine translation, but also used for summarization and other text generation tasks.

It was designed to correlate better with human judgment than BLEU and ROUGE, especially at the sentence level.


🔷 6.2. High-Level Intuition

METEOR measures the similarity between a candidate sentence and one or more reference sentences using unigram matches, but:

  • Considers exact matches, stems, synonyms, and paraphrases.
  • Uses harmonic mean of precision and recall, favoring recall.
  • Penalizes fragmentation in the word alignment to enforce fluency and order.

BLEU focuses only on precision, ROUGE on recall, but METEOR balances both and adds semantic and structural matching.


🔷 6.3. Matching Process

Matches between candidate (C) and reference (R) are based on:

  1. Exact match (word = word)
  2. Stem match (e.g., “run” vs “running”)
  3. Synonym match (e.g., “car” vs “automobile”)
  4. Paraphrase match (optional)

Matching is done once in order of preference (exact > stem > synonym > paraphrase).


🔷 6.4. METEOR Calculation Steps


🔹 A. Unigram Precision and Recall

Let:

  • $m$: Number of matched unigrams
  • $ C $: Number of unigrams in candidate
  • $ R $: Number of unigrams in reference
\[P = \frac{m}{|C|}, \quad R = \frac{m}{|R|}\]

🔹 B. Harmonic Mean (F-score)

METEOR uses a weighted harmonic mean (favoring recall):

\[F_{\text{mean}} = \frac{10 \cdot P \cdot R}{R + 9 \cdot P}\]

This weights recall 10× more than precision by default (can be tuned).


🔹 C. Fragmentation Penalty

Let:

  • $ch$: Number of contiguous chunks of matched words (in correct order)
  • $m$: Total matches

Then penalty:

\[\text{Penalty} = 0.5 \cdot \left( \frac{ch}{m} \right)^{3}\]

More chunks → more fragmented → higher penalty


🔹 D. Final METEOR Score

\[\text{METEOR} = F_{\text{mean}} \cdot (1 - \text{Penalty})\]

🔷 6.5. Example

Reference: "the cat is on the mat" Candidate: "the mat is on the cat"

Matched unigrams: the, cat, is, on, mat → $m = 5$ Unigram precision $P = \frac{5}{6}$, recall $R = \frac{5}{6}$

\[F_{\text{mean}} = \frac{10 \cdot \frac{5}{6} \cdot \frac{5}{6}}{\frac{5}{6} + 9 \cdot \frac{5}{6}} = \frac{250}{300} = 0.833\]

Assume the matches are broken into 3 contiguous chunks: → $ch = 3, m = 5$

\[\text{Penalty} = 0.5 \cdot \left( \frac{3}{5} \right)^3 \approx 0.216\] \[\text{METEOR} = 0.833 \cdot (1 - 0.216) \approx 0.653\]

🔷 6.6. Advantages Over BLEU/ROUGE

Feature METEOR BLEU / ROUGE
Recall-sensitive ✅ Yes BLEU ❌ (precision), ROUGE ✅
Synonyms/stemming ✅ Yes ❌ No
Chunk penalty ✅ Yes ❌ No
Word order ✅ Yes (via penalty) BLEU partial, ROUGE-L only
Sentence-level use ✅ Good correlation BLEU: bad at sentence level

🔷 6.7. When to Use METEOR

  • Machine translation tasks
  • Text summarization
  • Sentence-level generation evaluation
  • When you want semantic similarity and meaning-aware evaluation

🔷 6.8. Comparison Summary

Metric Precision Recall Word Order Semantics Best Use
BLEU Partial Translation (corpora)
ROUGE LCS-based Summarization
METEOR ✅ (chunks) Sentence-level tasks

Would you like me to:

  • Implement METEOR in Python?
  • Compare METEOR and BERTScore side-by-side?
  • Evaluate example sentences with each metric?

7. BERTScore

BERTScore is a modern, semantic similarity metric for evaluating text generation models like machine translation, summarization, captioning, etc. It uses deep contextual embeddings (from BERT or similar transformers) to compare generated sentences to reference sentences, at the semantic level, not just surface-level n-gram overlap.


🔷 7.1. Why BERTScore?

Traditional metrics like BLEU, ROUGE, and METEOR:

  • Rely on exact token overlap.
  • Struggle with synonyms, paraphrasing, and semantic equivalence.
  • Often fail to reflect human judgment for fluent but worded-differently sentences.

BERTScore overcomes these by using pretrained language models to evaluate semantic similarity of words in context.


🔷 7.2. Key Intuition

Instead of matching n-grams or counting word overlap, BERTScore:

  1. Encodes candidate and reference sentences using BERT (or similar transformer).
  2. Compares embeddings of words in candidate and reference via cosine similarity.
  3. Uses precision, recall, and F1 based on these similarities.

🔷 7.3. Mathematical Formulation

Let:

  • $\mathbf{C} = [\mathbf{c}_1, …, \mathbf{c}_m]$: embeddings of candidate words
  • $\mathbf{R} = [\mathbf{r}_1, …, \mathbf{r}_n]$: embeddings of reference words

All embeddings come from a contextual encoder (like BERT), so they’re context-aware.


🔹 Step 1: Cosine Similarity Matrix

Compute a similarity matrix $S \in \mathbb{R}^{m \times n}$:

\[S_{ij} = \cos(\mathbf{c}_i, \mathbf{r}_j)\]

This measures how similar each word in the candidate is to each word in the reference.


🔹 Step 2: Precision, Recall

  • Precision: For each word in candidate, find the max similarity with any reference word.
\[\text{Precision} = \frac{1}{m} \sum_{i=1}^{m} \max_{j} S_{ij}\]
  • Recall: For each word in reference, find max similarity with any candidate word.
\[\text{Recall} = \frac{1}{n} \sum_{j=1}^{n} \max_{i} S_{ij}\]

🔹 Step 3: F1 Score

\[\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

This F1 is the BERTScore, and can be averaged over many sentence pairs.


🔷 7.4. Example

Reference: “A quick brown fox jumps over the lazy dog”
Candidate: “A fast dark fox leaps over a sleepy dog”

Traditional metrics would score low due to low n-gram overlap. BERTScore would score high because:

  • “quick” ≈ “fast”
  • “jumps” ≈ “leaps”
  • “lazy” ≈ “sleepy”

🔷 7.5. Features of BERTScore

Feature Description
Uses contextual embeddings Understands words in context
Token-level matching Matches words even if paraphrased
Supports multilingual BERT For cross-lingual tasks
Tunable models You can use RoBERTa, DeBERTa, etc.

🔷 7.6. Strengths and Weaknesses

✅ Pros ❌ Cons
Strong correlation with human judgment Slower than BLEU/ROUGE
Handles synonyms/paraphrases Depends on quality of language model
Context-aware comparison Harder to interpret scores
Good for long, fluent sentences May overestimate fluency

🔷 7.7. Comparison to BLEU/ROUGE/METEOR

Metric Surface Match Semantics Context-Aware Speed Interpretable
BLEU ✅ Fast ✅ (to a degree)
ROUGE ✅ Fast
METEOR ✅ + stems/syns Limited Moderate
BERTScore ❌ Slower

🔷 7.8. Code Snippet (Python)

from bert_score import score

candidate = ["A fast dark fox leaps over a sleepy dog"]
reference = ["A quick brown fox jumps over the lazy dog"]

P, R, F1 = score(candidate, reference, lang="en", model_type="bert-base-uncased")
print(f"Precision: {P.item():.4f}, Recall: {R.item():.4f}, F1: {F1.item():.4f}")

8. Exact Match (EM)

Exact Match (EM) is a binary evaluation metric that checks whether a model’s output exactly matches the reference answer — character for character or token for token — after normalization.


🔷 8.1. What Is Exact Match (EM)?

Let:

  • $C$: Candidate/generated output
  • $R$: Reference answer
\[\text{EM}(C, R) = \begin{cases} 1 & \text{if } \text{normalize}(C) = \text{normalize}(R) \\ 0 & \text{otherwise} \end{cases}\]

Then over a dataset of $N$ samples:

\[\text{Exact Match Score} = \frac{1}{N} \sum_{i=1}^{N} \text{EM}(C_i, R_i)\]

It is often expressed as a percentage.


🔷 8.2. Where It’s Used

Task EM Usage
Question Answering Does the predicted answer match exactly?
Span Extraction (SQuAD) Is the predicted span identical to the correct one?
Code generation Is the predicted code identical to the ground truth?

🔷 8.3. Normalization Rules

To avoid penalizing trivial differences, normalization is usually applied:

Common Steps:

  • Lowercase both strings
  • Remove articles (a, an, the)
  • Remove punctuation
  • Remove extra whitespace

Example:

Candidate Reference
“The Eiffel Tower” “eiffel tower”

→ After normalization: both become "eiffel tower" → EM = 1


🔷 8.4. Strengths and Weaknesses

✅ Pros ❌ Cons
Simple and interpretable Extremely strict
Used in exact span tasks (e.g. SQuAD) No partial credit
No tuning or hyperparameters Doesn’t account for synonyms/paraphrasing

🔷 8.5. Comparison to Other Metrics

Metric Partial Credit Semantic Understanding Use Case
Exact Match ❌ No ❌ No Span-based QA, Code QA
F1 (token overlap) ✅ Yes ❌ No QA with multiple valid forms
BLEU/ROUGE ✅ Yes ❌ Limited Translation, summarization
BERTScore ✅ Yes ✅ Yes Semantic text generation tasks

🔷 8.6. When To Use

  • When only one correct answer is expected.
  • When evaluating retrieval or span selection tasks.
  • As a complementary metric with F1 or BLEU to gauge how often the model is perfectly right.

9. Token-Level Accuracy

Token-Level Accuracy is a fine-grained metric that measures how many individual tokens (words, subwords, or characters) in the model’s output match the reference tokens at the same positions.

It is useful when partial correctness matters — for example, in:

  • text classification
  • sequence labeling (like Named Entity Recognition)
  • translation
  • code generation

🔷 9.1. Definition

Let:

  • $\mathbf{C} = [c_1, c_2, …, c_n]$: Candidate tokens
  • $\mathbf{R} = [r_1, r_2, …, r_n]$: Reference tokens

Then:

\[\text{Token-Level Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(c_i = r_i)\]

Where:

  • $\mathbb{1}(c_i = r_i)$ is 1 if token $i$ matches, 0 otherwise
  • $n$: Total number of tokens in the reference

✅ Example 1 (Machine Translation):

Reference: ["the", "cat", "is", "black"] Candidate: ["the", "dog", "is", "black"] → Match: the, is, black → 3 out of 4 → Token Accuracy = $\frac{3}{4} = 0.75$


✅ Example 2 (NER):

Reference Tags: ["O", "B-PER", "I-PER", "O"] Predicted Tags: ["O", "B-PER", "O", "O"] → Match: 3 out of 4 → Token Accuracy = $0.75$


🔷 9.2. Applications

Task Token Match Used For
Named Entity Recognition Match predicted tag per token
Part-of-Speech Tagging Match POS per word
Machine Translation Match target tokens
Code Generation Match tokens in predicted code
Autocomplete / LM Tasks Predict next token

🔷 9.3. Strengths and Limitations

✅ Pros ❌ Cons
Fine-grained feedback Ignores sentence structure
Useful for sequence tasks Can reward grammatically incorrect but token-matching outputs
Easy to compute and interpret Doesn’t reflect semantic correctness

🔷 9.4. Comparison Table

Metric Matches… Position Sensitive? Partial Credit Semantic?
Exact Match Whole sentence ✅ Yes ❌ No ❌ No
Token Accuracy Individual tokens ✅ Yes ✅ Yes ❌ No
BLEU/ROUGE n-gram overlap ✅ Partially ✅ Yes ❌/🔸 Partial
BERTScore Token embeddings ❌ No ✅ Yes ✅ Yes

10. Pass\@k

Pass\@k (pronounced “pass at k”) is a code generation evaluation metric that measures how often at least one out of $k$ generated code samples solves the problem (i.e., passes all test cases).


🔷 10.1. Context: Why Pass\@k?

When language models (like Codex or GPT) are asked to generate code, their output may vary with sampling. Instead of judging just the first try, Pass\@k answers:

If I sample $k$ solutions, what’s the probability that at least one passes all the test cases?

It’s widely used in code generation benchmarks like:

  • HumanEval (OpenAI)
  • MBPP (Google)

🔷 10.2. Definition

Let:

  • $n$: Total number of samples generated for a problem
  • $k$: Number of allowed guesses (usually $k = 1, 5, 10, …$)
  • $c$: Number of correct samples (i.e., those that pass all test cases)

If $n \geq k$, the estimated Pass\@k is:

\[\text{Pass@}k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}\]

Intuition:

  • $\binom{n - c}{k}$: Number of ways to choose $k$ incorrect samples
  • $\binom{n}{k}$: Number of total ways to choose $k$ samples
  • So: this is the probability that all $k$ are incorrect, and we subtract from 1 to get the chance that at least one is correct

🔷 10.3. Simplified Cases

  • If $c = 0$ → no correct solutions → $\text{Pass@}k = 0$
  • If $c = n$ → all correct → $\text{Pass@}k = 1$
  • If $k = 1$ → reduces to accuracy over $n$ samples

🔷 10.4. When to Use It

Use Case Why Pass\@k is Helpful
Code generation with sampling Reflects the benefit of retries
Evaluating creativity in LMs Models may succeed on retries
Human-in-the-loop coding Models are used interactively

🔷 10.5. Comparison to Other Metrics

Metric Measures… Partial Credit Uses Sampling
Accuracy First-sample correctness ❌ No ❌ No
Token Accuracy Token-level match ✅ Yes ❌ No
BLEU Textual overlap (not correctness) ✅ Yes ❌ No
Pass\@k At least one correct in k tries ✅ Yes (at k) ✅ Yes

🔷 10.6. Python Implementation

from math import comb

def pass_at_k(n, c, k):
    if c == 0 or k > n:
        return 0.0
    return 1.0 - (comb(n - c, k) / comb(n, k))

🔷 10.7. Example

Say for a coding problem:

  • You generate $n = 10$ samples
  • $c = 3$ of them pass all tests
  • You evaluate Pass\@5
\[\text{Pass@5} = 1 - \frac{\binom{7}{5}}{\binom{10}{5}} = 1 - \frac{21}{252} \approx 0.916\]

→ 91.6% chance that at least one of 5 samples solves the problem.


🔷 10.8. Notes on Use in Benchmarks

  • HumanEval by OpenAI uses Pass\@1, Pass\@10, etc.
  • Often averaged over all problems.
  • Typically assumes deterministic unit testing for correctness.

Let me know if you want:

  • Full simulation over a dataset
  • Extension to top-k diversity
  • Comparison to metrics like CodeBLEU or Exact Match for code.