Machine Translation

Example Data

The data consists of parallel sentences: the same sentence in a source language and its human-written translation in a target language.

Input (Source Language - English):

“The nearest Caltrain station is Diridon Station.”
Target (Target Language - French):

“La gare Caltrain la plus proche est la gare Diridon.”

Use Case Scenario

The goal is to automatically translate text from one language to another, breaking down language barriers for communication, business, and information access.

Real-time Communication: A tourist in San Jose uses Google Translate on their phone to ask for directions in English, and the app speaks the translated question in Spanish to a local resident.
Global Business: A company translates its technical documentation and marketing materials from English into Japanese and German to reach international markets.
Information Access: A researcher translates a scientific paper from Chinese into English to stay current with global developments in their field.

How It Works: A Mini-Tutorial

Machine Translation is the original, quintessential sequence-to-sequence (seq2seq) task that inspired the creation of the Transformer architecture.

Approach 1: Encoder-Decoder Models (The Classic & Gold Standard)

Models like the original Transformer, MarianMT, and T5 are purpose-built for translation and are considered the gold standard for this task.

Why the Encoder-Decoder Architecture is a Natural Fit

This architecture’s “read then write” division of labor is a perfect match for translation:

The Encoder’s Job: To read the entire source sentence (e.g., in English) using bi-directional attention. Its goal is to create a rich, language-agnostic representation of the sentence’s meaning, compressing it into a set of contextual vectors.
The Decoder’s Job: To generate a new sentence in the target language (e.g., French). It does this autoregressively, using cross-attention to constantly refer back to the encoder’s representation of the source sentence. This ensures the translation is faithful to the original meaning.

The Training Phase ⚙️

Encoding: The source sentence (English) is tokenized and fed into the Encoder.
Decoding: The target sentence (French) is fed to the Decoder for teacher forcing. At each step, the decoder uses cross-attention to consult the encoder’s output, allowing it to align words and concepts (e.g., aligning “station” with “gare”).
Loss Function: A standard Cross-Entropy Loss is calculated between the decoder’s predicted French tokens and the actual French tokens from the target sentence.

The Inference Phase 💡

The English sentence is encoded once. The decoder then starts with a [START] token and generates the French translation token by token until it produces an [END] token.

Approach 2: Decoder-Only Models (The Modern Generalist)

Large language models like GPT and Llama can perform translation by treating it as an instruction-following task.

Why This Works

Through extensive fine-tuning, these models learn the pattern of translation from prompts. They understand that when given text in one language and asked to translate it to another, their task is to generate the corresponding text in the target language.

The Training Phase ✍️

Input Formatting: The source and target sentences are concatenated into a single sequence using a prompt template.

Translate the following English text to French:
    
The nearest Caltrain station is Diridon Station.
    
French translation:
La gare Caltrain la plus proche est la gare Diridon.

Handling Unequal Lengths (Your Specific Question): This is the crucial part. The English input and French output have different token lengths. How does the model handle this during training? The answer lies in the Masked Loss Function.
- Step 1: A Single Sequence: The model doesn’t see two separate sentences; it sees one long, continuous sequence of tokens: [Prompt Tokens] + [Target Tokens].
- Step 2: Causal Masking: A standard causal mask is applied. When predicting any token, the model can only see the tokens that came before it in this long sequence.
- Step 3: Intelligent Loss Calculation: We use a “loss mask” to tell the model which predictions to learn from.
  - For all the tokens that are part of the prompt and the source sentence (e.g., "Translate...Diridon Station."), their loss is ignored. We don’t need to teach the model how to write the prompt.
  - The Cross-Entropy Loss is only calculated for the tokens in the target sentence (e.g., "La gare...gare Diridon.").
In short, the unequal length doesn’t matter. The model is simply performing next-token prediction on a long sequence, and the loss mask cleverly focuses the learning only on the part of the sequence we want it to generate (the translation).

The Inference Phase 🗣️

The new English sentence is placed into the prompt template, stopping right after "French translation:".
This entire text is fed to the model as a prefix.
The model generates the translation autoregressively, token by token, until it completes the sentence and outputs an end-of-sequence token.