Evaluation Metrics for Language Models

Language models are evaluated based on their performance on different tasks such as next-token prediction, classification, generation, or question answering. Below we begin with a complete explanation of Perplexity, a fundamental metric, and then briefly cover other common metrics.

1. Perplexity (PPL)

Definition:

\[\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log p(x_i \mid x_{<i}) \right)\]

Variables:

$N$: Total number of predicted tokens
$x_i$: The ground truth token at position $i$
$x_{<i}$: All tokens before $i$
$p(x_i \mid x_{<i})$: Model’s predicted probability for the ground truth token given the context

Ground Truth:

In perplexity, the ground truth is always the next token that the model should predict.
Example: For the sentence “The cat sat”, if input is “The”, then ground truth is “cat”.

Perplexity on a Batch:

Suppose you have a batch of $B$ sequences, each of length $T$. The model predicts the next token at each step. Perplexity is computed as:

\[\text{Perplexity}_{\text{batch}} = \exp\left(-\frac{1}{B \cdot T} \sum_{b=1}^{B} \sum_{t=1}^{T} \log p(x_t^{(b)} \mid x_{<t}^{(b)}) \right)\]

Where:

$B$: batch size
$T$: number of predicted tokens per sequence
$x_t^{(b)}$: the token at time step $t$ in sequence $b$
$p(x_t^{(b)} \mid x_{<t}^{(b)})$: model’s predicted probability for the correct next token

Why Exponential?

The inner sum is the average negative log-likelihood, which is in log-space.
Taking $\exp$ maps it back to the original probability scale.
This gives perplexity a more interpretable meaning: how many equally likely choices the model is guessing between on average.

Intuition:

Perplexity measures how surprised the model is by the correct next tokens.
Lower perplexity = better performance.
A model that predicts uniformly over a vocabulary of size $V$ has perplexity $V$.
If a model assigns high probability to the correct token at each step, perplexity is low.
Common in language modeling benchmarks like WikiText or Penn Treebank.

Other Metrics (Brief Overview)

2. Accuracy

\[\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}\]

Used for classification tasks.

3. Precision, Recall, F1

\[\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision + Recall}}\]

Used when labels are imbalanced or multi-label.

4. BLEU

🔷 4.1. What is BLEU?

BLEU is an automatic evaluation metric that compares a candidate translation (generated by a model) against one or more reference translations (usually human-written). It is based on n-gram precision, with a brevity penalty to discourage overly short translations.

🔷 4.2. Core Intuition

BLEU tries to answer:

“How many phrases in the candidate translation appear in the reference translation(s)?”

It measures:

n-gram overlap: How many unigrams, bigrams, trigrams, etc., from the candidate appear in the reference?
Precision, not recall.
Adjusted for sentence length via a brevity penalty.

🔷 4.3. Step-by-Step BLEU Calculation

Let’s denote:

$C$: Candidate translation
$R$: Reference translation(s)

📌 Step 1: n-gram Precision

For each $n \in {1, 2, …, N}$:

Extract all n-grams from candidate and references.
Count how many n-grams in the candidate also appear in the reference(s) (with clipping).

\[\text{Precision}_n = \frac{\sum_{\text{n-gram} \in C} \min(\text{Count}_C(n\text{-gram}), \text{MaxRefCount}(n\text{-gram}))}{\sum_{\text{n-gram} \in C} \text{Count}_C(n\text{-gram})}\]

✅ Clipping prevents repeating n-grams in the candidate to inflate precision.

📌 Step 2: Geometric Mean of Precisions

We compute the geometric mean of n-gram precisions up to $N$-gram (commonly $N=4$):

\[\text{GeoMean} = \exp\left( \sum_{n=1}^{N} w_n \cdot \log(\text{Precision}_n) \right)\]

Where:

$w_n$: weight for each $n$ (usually $w_n = \frac{1}{N}$)

📌 Step 3: Brevity Penalty (BP)

If candidate is shorter than reference, we penalize:

\[\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp\left(1 - \frac{r}{c}\right) & \text{if } c \leq r \end{cases}\]

Where:

$c$: length of candidate
$r$: length of reference (or closest reference if multiple)

📌 Step 4: Final BLEU Score

\[\text{BLEU} = \text{BP} \cdot \text{GeoMean}\]

🔷 4.4. Example

Candidate:

“the cat is on the mat”

Reference:

“there is a cat on the mat”

Unigram Precision:

Candidate: [the, cat, is, on, the, mat]
Reference: [there, is, a, cat, on, the, mat]

Counts:

the (x2) → appears once → min(2,1) = 1
cat → in both → 1
is → in both → 1
on → in both → 1
mat → in both → 1

Unigram precision:

\[P_1 = \frac{1+1+1+1+1}{6} = \frac{5}{6}\]

(Repeat for bigram, trigram, etc., to compute full BLEU)

🔷 4.5. Use Cases

Machine Translation
Text Generation
Summarization

🔷 4.6. Pros and Cons

✅ Pros	❌ Cons
Fast and simple	Ignores synonyms and meaning
Correlates with human judgment	Sensitive to exact wording
Standard benchmark	Not ideal for open-ended generation

🔷 4.7. Variants

METEOR: Uses stemming, synonyms
ROUGE: Better for summarization (recall-based)
CHRF: Character n-gram F-score
BLEURT, COMET: Use neural models

Let me know if you’d like a code implementation, or comparison between BLEU and other metrics like METEOR or ROUGE.

5. ROUGE

Certainly. Here’s a deep, structured, mathematical, and intuitive explanation of ROUGE, including formulas, use cases, and how it differs from BLEU.

🔷 5.1. What is ROUGE?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of metrics designed to evaluate automatic summaries or text generation systems by comparing them to reference summaries.

Unlike BLEU, which focuses on precision, ROUGE is centered on recall — i.e., how much of the reference text is captured by the generated output.

🔷 5.2. ROUGE Variants and Intuition

There are multiple variants of ROUGE. The main ones are:

Metric	Measures	Type
ROUGE-N	Overlap of n-grams	Recall
ROUGE-L	Longest common subsequence (LCS)	Sequence-level
ROUGE-W	Weighted LCS	Sequence-level
ROUGE-S	Skip-bigram	Pair-based

We’ll focus on the most commonly used:

ROUGE-1 (unigram recall)
ROUGE-2 (bigram recall)
ROUGE-L (longest common subsequence)

🔷 5.3. Formulas and Explanations

Let:

$C$: Candidate (generated text)
$R$: Reference (human-written text)

🔹 A. ROUGE-N (n-gram recall)

\[\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in R} \min(\text{Count}_C(n\text{-gram}), \text{Count}_R(n\text{-gram}))}{\sum_{\text{n-gram} \in R} \text{Count}_R(n\text{-gram})}\]

This is a recall-based metric:

Numerator: n-grams that both candidate and reference have
Denominator: total n-grams in reference

Intuition: What proportion of reference n-grams are covered by the candidate?

Example (ROUGE-1):

Reference: "the cat is on the mat" Candidate: "the mat is clean"

Unigrams in reference: the, cat, is, on, the, mat Unigrams in candidate: the, mat, is, clean

Overlap: the (1), is (1), mat (1) → count = 3 Total unigrams in reference = 6

\[\text{ROUGE-1} = \frac{3}{6} = 0.5\]

🔹 B. ROUGE-L (Longest Common Subsequence)

\[\text{LCS}(C, R) = \text{length of the longest common subsequence}\]

Then:

Recall:

\[R_{\text{LCS}} = \frac{\text{LCS}(C, R)}{|R|}\]

Precision:

\[P_{\text{LCS}} = \frac{\text{LCS}(C, R)}{|C|}\]

F1-score:

\[\text{ROUGE-L} = \frac{(1 + \beta^2) \cdot P_{\text{LCS}} \cdot R_{\text{LCS}}}{R_{\text{LCS}} + \beta^2 \cdot P_{\text{LCS}}}\]

Typically, $\beta = 1$ → harmonic mean of precision and recall.

Intuition: Measures how well the in-order structure of the reference is preserved in the candidate.

Example:

Reference: "the cat is on the mat" Candidate: "cat on mat"

LCS: "cat on mat" → length = 3 ROUGE-L recall = $\frac{3}{6}$, precision = $\frac{3}{3}$ F1 = $\frac{2 \cdot 1 \cdot 0.5}{1 + 0.5} = \frac{1}{1.5} \approx 0.666$

🔷 5.4. Use Cases of ROUGE

Text summarization (primary use case)
Story or article generation
Question answering
Paraphrase generation

🔷 5.5. Comparison: BLEU vs ROUGE

Property	BLEU	ROUGE
Focus	Precision (what is in candidate)	Recall (what is recovered from reference)
Common Use	Machine Translation	Summarization
Metric Type	n-gram precision	n-gram recall, LCS
Penalizes Length?	Yes (brevity penalty)	No (but lower recall if short)
Structure Capturing	No (except n-gram continuity)	Yes (via LCS in ROUGE-L)

🔷 5.6. Strengths and Weaknesses

✅ Pros	❌ Cons
Correlates well with human judgment	Surface-level overlap only
Fast to compute	No semantic matching
Flexible (supports multiple metrics)	Sensitive to paraphrasing

🔷 5.7. Summary Table

Metric	Formula/Idea	Focus	Use Case
ROUGE-1	Recall of unigrams	Recall	Summarization
ROUGE-2	Recall of bigrams	Recall	Summarization
ROUGE-L	LCS-based F1 (sequence match)	Structure	Story generation

6. METEOR

🔷 6.1. What is METEOR?

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a semantic-aware evaluation metric originally developed for machine translation, but also used for summarization and other text generation tasks.

It was designed to correlate better with human judgment than BLEU and ROUGE, especially at the sentence level.

🔷 6.2. High-Level Intuition

METEOR measures the similarity between a candidate sentence and one or more reference sentences using unigram matches, but:

Considers exact matches, stems, synonyms, and paraphrases.
Uses harmonic mean of precision and recall, favoring recall.
Penalizes fragmentation in the word alignment to enforce fluency and order.

BLEU focuses only on precision, ROUGE on recall, but METEOR balances both and adds semantic and structural matching.

🔷 6.3. Matching Process

Matches between candidate (C) and reference (R) are based on:

Exact match (word = word)
Stem match (e.g., “run” vs “running”)
Synonym match (e.g., “car” vs “automobile”)
Paraphrase match (optional)

Matching is done once in order of preference (exact > stem > synonym > paraphrase).

🔷 6.4. METEOR Calculation Steps

🔹 A. Unigram Precision and Recall

Let:

$m$: Number of matched unigrams
$ C $: Number of unigrams in candidate
$ R $: Number of unigrams in reference

\[P = \frac{m}{|C|}, \quad R = \frac{m}{|R|}\]

🔹 B. Harmonic Mean (F-score)

METEOR uses a weighted harmonic mean (favoring recall):

\[F_{\text{mean}} = \frac{10 \cdot P \cdot R}{R + 9 \cdot P}\]

This weights recall 10× more than precision by default (can be tuned).

🔹 C. Fragmentation Penalty

Let:

$ch$: Number of contiguous chunks of matched words (in correct order)
$m$: Total matches

Then penalty:

\[\text{Penalty} = 0.5 \cdot \left( \frac{ch}{m} \right)^{3}\]

More chunks → more fragmented → higher penalty

🔹 D. Final METEOR Score

\[\text{METEOR} = F_{\text{mean}} \cdot (1 - \text{Penalty})\]

🔷 6.5. Example

Reference: "the cat is on the mat" Candidate: "the mat is on the cat"

Matched unigrams: the, cat, is, on, mat → $m = 5$ Unigram precision $P = \frac{5}{6}$, recall $R = \frac{5}{6}$

\[F_{\text{mean}} = \frac{10 \cdot \frac{5}{6} \cdot \frac{5}{6}}{\frac{5}{6} + 9 \cdot \frac{5}{6}} = \frac{250}{300} = 0.833\]

Assume the matches are broken into 3 contiguous chunks: → $ch = 3, m = 5$

\[\text{Penalty} = 0.5 \cdot \left( \frac{3}{5} \right)^3 \approx 0.216\] \[\text{METEOR} = 0.833 \cdot (1 - 0.216) \approx 0.653\]

🔷 6.6. Advantages Over BLEU/ROUGE

Feature	METEOR	BLEU / ROUGE
Recall-sensitive	✅ Yes	BLEU ❌ (precision), ROUGE ✅
Synonyms/stemming	✅ Yes	❌ No
Chunk penalty	✅ Yes	❌ No
Word order	✅ Yes (via penalty)	BLEU partial, ROUGE-L only
Sentence-level use	✅ Good correlation	BLEU: bad at sentence level

🔷 6.7. When to Use METEOR

Machine translation tasks
Text summarization
Sentence-level generation evaluation
When you want semantic similarity and meaning-aware evaluation

🔷 6.8. Comparison Summary

Metric	Precision	Recall	Word Order	Semantics	Best Use
BLEU	✅	❌	Partial	❌	Translation (corpora)
ROUGE	❌	✅	LCS-based	❌	Summarization
METEOR	✅	✅	✅ (chunks)	✅	Sentence-level tasks

Would you like me to:

Implement METEOR in Python?
Compare METEOR and BERTScore side-by-side?
Evaluate example sentences with each metric?

7. BERTScore

BERTScore is a modern, semantic similarity metric for evaluating text generation models like machine translation, summarization, captioning, etc. It uses deep contextual embeddings (from BERT or similar transformers) to compare generated sentences to reference sentences, at the semantic level, not just surface-level n-gram overlap.

🔷 7.1. Why BERTScore?

Traditional metrics like BLEU, ROUGE, and METEOR:

Rely on exact token overlap.
Struggle with synonyms, paraphrasing, and semantic equivalence.
Often fail to reflect human judgment for fluent but worded-differently sentences.

BERTScore overcomes these by using pretrained language models to evaluate semantic similarity of words in context.

🔷 7.2. Key Intuition

Instead of matching n-grams or counting word overlap, BERTScore:

Encodes candidate and reference sentences using BERT (or similar transformer).
Compares embeddings of words in candidate and reference via cosine similarity.
Uses precision, recall, and F1 based on these similarities.

🔷 7.3. Mathematical Formulation

Let:

$\mathbf{C} = [\mathbf{c}_1, …, \mathbf{c}_m]$: embeddings of candidate words
$\mathbf{R} = [\mathbf{r}_1, …, \mathbf{r}_n]$: embeddings of reference words

All embeddings come from a contextual encoder (like BERT), so they’re context-aware.

🔹 Step 1: Cosine Similarity Matrix

Compute a similarity matrix $S \in \mathbb{R}^{m \times n}$:

\[S_{ij} = \cos(\mathbf{c}_i, \mathbf{r}_j)\]

This measures how similar each word in the candidate is to each word in the reference.

🔹 Step 2: Precision, Recall

Precision: For each word in candidate, find the max similarity with any reference word.

\[\text{Precision} = \frac{1}{m} \sum_{i=1}^{m} \max_{j} S_{ij}\]

Recall: For each word in reference, find max similarity with any candidate word.

\[\text{Recall} = \frac{1}{n} \sum_{j=1}^{n} \max_{i} S_{ij}\]

🔹 Step 3: F1 Score

\[\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

This F1 is the BERTScore, and can be averaged over many sentence pairs.

🔷 7.4. Example

Reference:	“A quick brown fox jumps over the lazy dog”
Candidate:	“A fast dark fox leaps over a sleepy dog”

Traditional metrics would score low due to low n-gram overlap. BERTScore would score high because:

“quick” ≈ “fast”
“jumps” ≈ “leaps”
“lazy” ≈ “sleepy”

🔷 7.5. Features of BERTScore

Feature	Description
Uses contextual embeddings	Understands words in context
Token-level matching	Matches words even if paraphrased
Supports multilingual BERT	For cross-lingual tasks
Tunable models	You can use RoBERTa, DeBERTa, etc.

🔷 7.6. Strengths and Weaknesses

✅ Pros	❌ Cons
Strong correlation with human judgment	Slower than BLEU/ROUGE
Handles synonyms/paraphrases	Depends on quality of language model
Context-aware comparison	Harder to interpret scores
Good for long, fluent sentences	May overestimate fluency

🔷 7.7. Comparison to BLEU/ROUGE/METEOR

Metric	Surface Match	Semantics	Context-Aware	Speed	Interpretable
BLEU	✅	❌	❌	✅ Fast	✅ (to a degree)
ROUGE	✅	❌	❌	✅ Fast	✅
METEOR	✅ + stems/syns	Limited	❌	✅	Moderate
BERTScore	❌	✅	✅	❌ Slower	❌

🔷 7.8. Code Snippet (Python)

from bert_score import score

candidate = ["A fast dark fox leaps over a sleepy dog"]
reference = ["A quick brown fox jumps over the lazy dog"]

P, R, F1 = score(candidate, reference, lang="en", model_type="bert-base-uncased")
print(f"Precision: {P.item():.4f}, Recall: {R.item():.4f}, F1: {F1.item():.4f}")

8. Exact Match (EM)

Exact Match (EM) is a binary evaluation metric that checks whether a model’s output exactly matches the reference answer — character for character or token for token — after normalization.

🔷 8.1. What Is Exact Match (EM)?

Let:

$C$: Candidate/generated output
$R$: Reference answer

\[\text{EM}(C, R) = \begin{cases} 1 & \text{if } \text{normalize}(C) = \text{normalize}(R) \\ 0 & \text{otherwise} \end{cases}\]

Then over a dataset of $N$ samples:

\[\text{Exact Match Score} = \frac{1}{N} \sum_{i=1}^{N} \text{EM}(C_i, R_i)\]

It is often expressed as a percentage.

🔷 8.2. Where It’s Used

Task	EM Usage
Question Answering	Does the predicted answer match exactly?
Span Extraction (SQuAD)	Is the predicted span identical to the correct one?
Code generation	Is the predicted code identical to the ground truth?

🔷 8.3. Normalization Rules

To avoid penalizing trivial differences, normalization is usually applied:

Common Steps:

Lowercase both strings
Remove articles (a, an, the)
Remove punctuation
Remove extra whitespace

Example:

Candidate	Reference
“The Eiffel Tower”	“eiffel tower”

→ After normalization: both become "eiffel tower" → EM = 1

🔷 8.4. Strengths and Weaknesses

✅ Pros	❌ Cons
Simple and interpretable	Extremely strict
Used in exact span tasks (e.g. SQuAD)	No partial credit
No tuning or hyperparameters	Doesn’t account for synonyms/paraphrasing

🔷 8.5. Comparison to Other Metrics

Metric	Partial Credit	Semantic Understanding	Use Case
Exact Match	❌ No	❌ No	Span-based QA, Code QA
F1 (token overlap)	✅ Yes	❌ No	QA with multiple valid forms
BLEU/ROUGE	✅ Yes	❌ Limited	Translation, summarization
BERTScore	✅ Yes	✅ Yes	Semantic text generation tasks

🔷 8.6. When To Use

When only one correct answer is expected.
When evaluating retrieval or span selection tasks.
As a complementary metric with F1 or BLEU to gauge how often the model is perfectly right.

9. Token-Level Accuracy

Token-Level Accuracy is a fine-grained metric that measures how many individual tokens (words, subwords, or characters) in the model’s output match the reference tokens at the same positions.

It is useful when partial correctness matters — for example, in:

text classification
sequence labeling (like Named Entity Recognition)
translation
code generation

🔷 9.1. Definition

Let:

$\mathbf{C} = [c_1, c_2, …, c_n]$: Candidate tokens
$\mathbf{R} = [r_1, r_2, …, r_n]$: Reference tokens

Then:

\[\text{Token-Level Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(c_i = r_i)\]

Where:

$\mathbb{1}(c_i = r_i)$ is 1 if token $i$ matches, 0 otherwise
$n$: Total number of tokens in the reference

✅ Example 1 (Machine Translation):

Reference: ["the", "cat", "is", "black"] Candidate: ["the", "dog", "is", "black"] → Match: the, is, black → 3 out of 4 → Token Accuracy = $\frac{3}{4} = 0.75$

✅ Example 2 (NER):

Reference Tags: ["O", "B-PER", "I-PER", "O"] Predicted Tags: ["O", "B-PER", "O", "O"] → Match: 3 out of 4 → Token Accuracy = $0.75$

🔷 9.2. Applications

Task	Token Match Used For
Named Entity Recognition	Match predicted tag per token
Part-of-Speech Tagging	Match POS per word
Machine Translation	Match target tokens
Code Generation	Match tokens in predicted code
Autocomplete / LM Tasks	Predict next token

🔷 9.3. Strengths and Limitations

✅ Pros	❌ Cons
Fine-grained feedback	Ignores sentence structure
Useful for sequence tasks	Can reward grammatically incorrect but token-matching outputs
Easy to compute and interpret	Doesn’t reflect semantic correctness

🔷 9.4. Comparison Table

Metric	Matches…	Position Sensitive?	Partial Credit	Semantic?
Exact Match	Whole sentence	✅ Yes	❌ No	❌ No
Token Accuracy	Individual tokens	✅ Yes	✅ Yes	❌ No
BLEU/ROUGE	n-gram overlap	✅ Partially	✅ Yes	❌/🔸 Partial
BERTScore	Token embeddings	❌ No	✅ Yes	✅ Yes

10. Pass\@k

Pass\@k (pronounced “pass at k”) is a code generation evaluation metric that measures how often at least one out of $k$ generated code samples solves the problem (i.e., passes all test cases).

🔷 10.1. Context: Why Pass\@k?

When language models (like Codex or GPT) are asked to generate code, their output may vary with sampling. Instead of judging just the first try, Pass\@k answers:

If I sample $k$ solutions, what’s the probability that at least one passes all the test cases?

It’s widely used in code generation benchmarks like:

HumanEval (OpenAI)
MBPP (Google)

🔷 10.2. Definition

Let:

$n$: Total number of samples generated for a problem
$k$: Number of allowed guesses (usually $k = 1, 5, 10, …$)
$c$: Number of correct samples (i.e., those that pass all test cases)

If $n \geq k$, the estimated Pass\@k is:

\[\text{Pass@}k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}\]

Intuition:

$\binom{n - c}{k}$: Number of ways to choose $k$ incorrect samples
$\binom{n}{k}$: Number of total ways to choose $k$ samples
So: this is the probability that all $k$ are incorrect, and we subtract from 1 to get the chance that at least one is correct

🔷 10.3. Simplified Cases

If $c = 0$ → no correct solutions → $\text{Pass@}k = 0$
If $c = n$ → all correct → $\text{Pass@}k = 1$
If $k = 1$ → reduces to accuracy over $n$ samples

🔷 10.4. When to Use It

Use Case	Why Pass\@k is Helpful
Code generation with sampling	Reflects the benefit of retries
Evaluating creativity in LMs	Models may succeed on retries
Human-in-the-loop coding	Models are used interactively

🔷 10.5. Comparison to Other Metrics

Metric	Measures…	Partial Credit	Uses Sampling
Accuracy	First-sample correctness	❌ No	❌ No
Token Accuracy	Token-level match	✅ Yes	❌ No
BLEU	Textual overlap (not correctness)	✅ Yes	❌ No
Pass\@k	At least one correct in k tries	✅ Yes (at k)	✅ Yes

🔷 10.6. Python Implementation

from math import comb

def pass_at_k(n, c, k):
    if c == 0 or k > n:
        return 0.0
    return 1.0 - (comb(n - c, k) / comb(n, k))

🔷 10.7. Example

Say for a coding problem:

You generate $n = 10$ samples
$c = 3$ of them pass all tests
You evaluate Pass\@5

\[\text{Pass@5} = 1 - \frac{\binom{7}{5}}{\binom{10}{5}} = 1 - \frac{21}{252} \approx 0.916\]

→ 91.6% chance that at least one of 5 samples solves the problem.

🔷 10.8. Notes on Use in Benchmarks

HumanEval by OpenAI uses Pass\@1, Pass\@10, etc.
Often averaged over all problems.
Typically assumes deterministic unit testing for correctness.

Let me know if you want:

Full simulation over a dataset
Extension to top-k diversity
Comparison to metrics like CodeBLEU or Exact Match for code.