Home

GenAI

GenAI

Semantic Embedding, Clustering and Search

A standard transformer model like BERT is excellent at understanding words in context (token-level embeddings), but it is not inherently designed to create a single, meaningful vector for an entire sentence that can be easily compared with others. Sentence-Transformers solve this problem by fine-tuning these models to produce high-quality, sentence-level embeddings.

Example Data

The training data format depends on the training objective.

  • For Supervised Training (on similarity):
    • Sentence A: “The weather in San Jose is sunny today.”
    • Sentence B: “It is bright and clear in San Jose right now.”
    • Target Label (y): 1.0 (indicating high similarity)
  • For Self-Supervised Training (using triplets):
    • Anchor: “What is the capital of California?”
    • Positive (Similar): “Sacramento is the capital of California.”
    • Negative (Dissimilar): “The Golden Gate Bridge is in San Francisco.”
Use Case Scenario

The goal is to convert a sentence into a fixed-size numerical vector (an embedding) where sentences with similar meanings have vectors that are close together in a high-dimensional space. This enables powerful applications:

  • Semantic Search: A user on a support website asks, "How do I reset my password if I lost my 2FA device?" The system converts this query into an embedding and instantly finds the most semantically similar questions and answers in its knowledge base, providing the exact solution.
  • Duplicate Detection: A platform like Stack Overflow can identify if a newly asked question is a semantic duplicate of an existing one, even if they use different words.
  • Clustering: Grouping thousands of news articles, user reviews, or documents by the topic or event they describe. This is a core component of modern RAG (Retrieval-Augmented Generation) systems.

How It Works: A Mini-Tutorial
Architecture

The architecture is a two-stage process:

  1. Transformer Base: A sentence is fed into a pre-trained transformer model like BERT or RoBERTa. This model uses bi-directional attention to process the entire sentence and outputs a contextualized vector for every token.
  2. Pooling Layer: To get a single vector for the entire sentence, the output token vectors are “pooled.” The most common method is mean pooling, where you simply take the element-wise average of all the token vectors to produce one fixed-size sentence embedding.
The Training Phase

This is the most critical part. The model is fine-tuned to ensure the final pooled embeddings are semantically meaningful. This can be done with or without human-provided labels.

1. Supervised Fine-Tuning (Requires Labels) This is the most common approach for state-of-the-art performance, where the model learns to mimic human judgments.

  • Process: A Siamese Network structure is used. Two sentences, s_1 and s_2, are passed through the exact same Sentence-Transformer model (with shared weights) to produce two embeddings, u and v.
  • Loss Function: Cosine Similarity Loss. The goal is to make the model’s predicted similarity score match a human-provided score y.
    • Normalization: For cosine similarity to work correctly, the embedding vectors u and v must be normalized to have a length of 1. This can be done as a final layer in the model or as a step within the loss function.
    • Mathematical Formulation: The cosine similarity between two vectors is their dot product divided by the product of their magnitudes. \(\text{cos_sim}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}\) The model is trained to minimize a regression loss, typically Mean Squared Error (MSE), between its predicted similarity and the ground-truth label y.
\[L_{\text{cosine}} = \frac{1}{N} \sum_{i=1}^{N} (\text{cos_sim}(u_i, v_i) - y_i)^2\]

2. Self-Supervised / Unsupervised Fine-Tuning (No y score needed) This approach uses the structure of the data itself to create learning signals. The Triplet Network is a classic example.

  • Process: A Triplet Network structure is used. Three sentences—an anchor (a), a positive (p), and a negative (n)—are passed through the same model to get three embeddings: a, p, and n.
  • Loss Function: Triplet Loss. The goal is to “push” the negative embedding away from the anchor and “pull” the positive embedding closer. The loss function ensures that the anchor-to-positive distance is smaller than the anchor-to-negative distance by at least a certain margin (α).
    • Mathematical Formulation: Using Euclidean distance $d(x, y) =   x - y   _2$, the objective is: $d(a, p) + α < d(a, n)$. The loss function that enforces this is:
\[L_{\text{triplet}} = \max(0, d(a, p)^2 - d(a, n)^2 + \alpha)\]
* The loss is zero if the condition is already met. Otherwise, the model is penalized, forcing it to adjust the embedding space correctly.
The Inference Phase

Once fine-tuned, using the model is simple and fast:

  1. Input: A new sentence.
  2. Process: Feed the sentence through the fine-tuned Sentence-Transformer.
  3. Output: A single, fixed-size vector embedding that captures the sentence’s meaning, ready to be used for search, clustering, or other downstream tasks. ***