Learning Muse by Mehdi Seyfi

A Deep Dive into SwiGLU: The Modern FFN for High-Performance Transformers

While Mixture of Experts (MoE) offers a path to scale models to enormous parameter counts via sparsity, another critical innovation has optimized the performance of the core “dense” Transformer block itself. This is the Gated Linear Unit (GLU), and specifically its most successful variant, SwiGLU.

As of mid-2025, SwiGLU has replaced the traditional FFN in most state-of-the-art dense language models (like Meta’s Llama series, Google’s PaLM, and Mistral’s dense models). This tutorial explores the intuition, mathematics, and implementation behind this powerful and efficient architecture.

1. The Intuition: A Dynamic Information Filter

To understand SwiGLU’s advantage, we must first see the limitation of the standard FFN.

A Standard FFN is a “Blunt Instrument”: It projects a token’s representation into a larger space and uniformly applies a non-linear function like ReLU or GeLU. This function can scale or suppress features, but it does so based on a fixed rule, regardless of the token’s specific context. It’s like applying a single, static photo filter to every image.
SwiGLU is a “Smart Mixing Board”: SwiGLU introduces a gating mechanism, which acts like a dynamic, fine-grained filter for information. Think of it like a professional audio mixing board for each token.
- The “Value” Path: One part of the network computes the primary content or signal. This is like the raw audio track—the melody, the harmony, the core information.
- The “Gate” Path: A second, parallel part of the network acts as the faders and knobs on the mixing board. It doesn’t process the main content; instead, it looks at the input token and decides which parts of the content are important and should be amplified, and which parts are noise and should be suppressed.
- The Gating Operation: The gate’s output (a vector of values between 0 and 1) is multiplied element-wise with the value vector. If a gate value is 0, the corresponding feature in the value path is silenced. If it’s 1, it’s passed through at full volume.

This mechanism gives the network the ability to dynamically control the flow of information for each token, leading to a much more expressive and efficient use of its parameters. It learns not just what to compute, but how much of each computation to actually use.

2. Architectural Anatomy and Mathematical Formulation

SwiGLU’s architecture is a departure from the simple two-layer MLP of the vanilla FFN.

Vanilla FFN Structure (Recap)

For a single token input $x \in \mathbb{R}^{d_{model}}$, the structure is a monolithic block: $y_{FFN}(x) = W_2 (\text{GELU}(xW_1))$

It uses two weight matrices, $W_1$ and $W_2$.

SwiGLU FFN Structure (Gated)

The SwiGLU architecture uses three weight matrices and splits the up-projection into two parallel paths.

Let the input for a batch of sequences be $x \in \mathbb{R}^{B \times L \times d_{model}}$. The intermediate dimension is $d_{ff}$.

Parallel Up-Projections: The input x is projected twice, in parallel, by two different weight matrices, $W_1$ and $W_2$, both of shape $\mathbb{R}^{d_{model} \times d_{ff}}$.
- Value Path: $V = xW_1$
- Gate Path: $G = xW_2$ Both V and G are tensors of shape $\mathbb{R}^{B \times L \times d_{ff}}$.
The Gating Mechanism: The core of the architecture. The most common formulation (used in Llama) applies the Swish (also called SiLU) activation to one path and multiplies it by the other. $H = \text{SiLU}(V) \odot G$
- $\text{SiLU}(x) = x \cdot \text{sigmoid}(x)$ is a smooth, non-monotonic activation function that performs well as a gate.
- $\odot$ denotes element-wise multiplication (Hadamard product).
- The resulting gated representation $H$ still has the shape $\mathbb{R}^{B \times L \times d_{ff}}$.
Down-Projection: A final linear layer with weight matrix $W_3 \in \mathbb{R}^{d_{ff} \times d_{model}}$ projects the gated representation back to the model’s dimension. $y_{SwiGLU} = HW_3$

The full formula is: $y_{SwiGLU}(x) = (\text{SiLU}(xW_1) \odot (xW_2)) W_3$

3. Full Complexity Analysis

SwiGLU’s efficiency is not about having fewer FLOPs per se, but about achieving better performance for a given number of parameters.

Parameter Count

The architecture uses three weight matrices: $W_1$, $W_2$, and $W_3$. For simplicity, modern implementations often omit biases. $P_{SwiGLU} = (d_{model} \cdot d_{ff}) + (d_{model} \cdot d_{ff}) + (d_{ff} \cdot d_{model})$
Practical Consideration: A standard FFN with $d_{ff} = 4d_{model}$ has $P_{FFN} \approx 2 \cdot d_{model} \cdot (4d_{model}) = 8d_{model}^2$ parameters. The SwiGLU FFN has $P_{SwiGLU} \approx 3 \cdot d_{model} \cdot d_{ff}$ parameters. To keep the parameter count the same as the baseline, models like Llama use a smaller intermediate dimension, typically $d_{ff} = \frac{2}{3} \cdot (4d_{model}) \approx 2.67 \cdot d_{model}$. This makes the total parameters nearly identical: $3 \cdot d_{model} \cdot (\frac{8}{3}d_{model}) = 8d_{model}^2$.

Computational Complexity (FLOPs)

Up-Projections ($W_1, W_2$): There are two parallel projections, each costing $\approx B \cdot L \cdot 2 \cdot d_{model} \cdot d_{ff}$. Total up-projection cost is $\approx B \cdot L \cdot 4 \cdot d_{model} \cdot d_{ff}$.
Down-Projection ($W_3$): This costs $\approx B \cdot L \cdot 2 \cdot d_{ff} \cdot d_{model}$.
Total FLOPs: $FLOPs_{SwiGLU} \approx B \cdot L \cdot (4 \cdot d_{model} \cdot d_{ff} + 2 \cdot d_{ff} \cdot d_{model}) = B \cdot L \cdot 6 \cdot d_{model} \cdot d_{ff}$

The Trade-off: For the same intermediate dimension $d_{ff}$, SwiGLU requires ~1.5x the FLOPs of a standard FFN. However, its superior expressiveness and empirical performance mean that a model can achieve better results with a smaller d_ff or fewer total layers, often making it more computationally efficient overall for a target performance level.

4. Why SWiGLU is better than FFN? Does not FFN lean to give less wight to less important tokens as does SWiGLU? What is the difference?

The core difference is static transformation vs. dynamic, input-dependent control.

Let’s break this down.

Standard FFN: A Static Filter

Think of the standard FFN block (e.g., ReLU(xW1)W2) as a static filter.

It learns two weight matrices, $W_1$ and $W_2$, during training.
Once trained, these matrices are fixed.
Every single token that passes through this FFN block is transformed by the exact same $W_1$ and $W_2$.

You are right that the output is different for each token, but that’s only because the input token ($x$) is different. The transformation itself is rigid.

Imagine you have a single, sophisticated camera lens filter (e.g., one that sharpens blues and softens reds). You apply this same filter to every photo you take. A photo of the ocean and a photo of a rose will look different, but the way the filter changes the photo is identical in both cases. The FFN is this fixed filter. It learns a single, generalized way to process information.

SwiGLU: A Dynamic, Intelligent Gate

SwiGLU (Swish-Gated Linear Unit) operates very differently. It creates a filter that is dynamically generated based on the content of the token itself.

Let’s look at the mechanics. A simplified GLU takes an input $x$ and passes it through two separate linear layers, $W$ and $V$:

\[\text{GLU}(x, W, V) = (xW) \otimes \sigma(xV)\]

$xW$: This is the “content” or “value” path, similar to the first linear layer in a standard FFN. It proposes a possible set of features.
$xV$: This is the “gate” path. It’s a separate linear layer whose sole purpose is to create a gate.
$\sigma(xV)$: An activation function (Sigmoid in the original GLU, Swish in SwiGLU) is applied to the gate path. This squashes the output to a range (mostly between 0 and 1), creating a set of values that act as switches.
$\otimes$: This is element-wise multiplication. The “content” ($xW$) is multiplied by the “gate” ($\sigma(xV)$).

This is where the magic happens. The gate, $\sigma(xV)$, decides which elements of the content, $xW$, should be passed through and which should be suppressed.

If an element in the gate is close to 0, it “closes the gate” for the corresponding feature in the content, effectively zeroing it out.
If an element in the gate is close to 1, it “opens the gate,” allowing the feature to pass through unchanged.

This means the transformation is no longer static. It is conditioned on the input token itself.

So, What is the Real Benefit?

Let’s go back to your core question: “If a token is less important, its weight will be less important. This is what exactly SwiGLU is doing right?”

Not quite. Here’s the crucial distinction:

A standard FFN has to learn a single set of weights ($W_1, W_2$) that works, on average, for all possible tokens. It cannot dedicate parts of its learned function to specific contexts. It might learn that, in general, certain features are less important, but it can’t decide that a feature is important for the token “queen” in “queen of England” but irrelevant for the token “queen” in “queen bee”.
SwiGLU can make that distinction. The gate ($xV$) for the token “queen” in the first sentence will be different from the gate for “queen” in the second. It learns to recognize the context of the token and dynamically creates a gate that says, “Okay, for this specific token in this context, these potential features in $xW$ are highly relevant, so let’s amplify them. And these other features are just noise, so let’s suppress them entirely.”

The benefits are profound:

Token-Specific Feature Selection: SwiGLU doesn’t just learn a general transformation; it learns a rule for how to create a specific transformation for each token. This is far more expressive.
Noise Reduction: FFNs can sometimes introduce noise. The expansion layer might create features that aren’t useful for a particular token. SwiGLU’s gate can learn to prune these irrelevant features on the fly, leading to a cleaner signal being passed to the next layer.
Improved Training: Gated mechanisms can lead to better gradient flow during backpropagation, making the network easier to train and allowing it to converge on better solutions. Empirical results consistently show that GLU variants like SwiGLU outperform standard FFNs for the same parameter count or computational budget.

In short, a linear layer learns a fixed hammer. SwiGLU learns to build a specialized multi-tool for every job it encounters. This dynamic, content-aware filtering is a significant step up in modeling power and a key reason why architectures like Llama use it.

5. Code Implementation

This PyTorch module implements the SwiGLU FFN layer as described.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLUFFN(nn.Module):
    """
    Implements the SwiGLU-based Feed-Forward Network, which is a modern
    standard in high-performance LLMs like Llama and PaLM.
    """
    def __init__(self, d_model: int, d_ff: int, bias: bool = False):
        """
        Args:
            d_model: The dimension of the model.
            d_ff: The intermediate dimension of the FFN. It is often set to
                  (2/3) * 4 * d_model to have a similar parameter count to a
                  standard FFN.
            bias: Whether to include bias terms in the linear layers.
                  Modern practice often omits them.
        """
        super().__init__()
        
        # The first linear layer for the gate path
        self.w_gate = nn.Linear(d_model, d_ff, bias=bias)
        
        # The second linear layer for the value path
        self.w_value = nn.Linear(d_model, d_ff, bias=bias)
        
        # The final down-projection linear layer
        self.w_out = nn.Linear(d_ff, d_model, bias=bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (B, L, d_model)

        # 1. Project input for gate and value paths in parallel
        gate_proj = self.w_gate(x)  # (B, L, d_ff)
        value_proj = self.w_value(x) # (B, L, d_ff)

        # 2. Apply the gating mechanism
        # Apply SiLU (Swish) activation to the gate projection
        # and multiply it element-wise with the value projection
        gated_output = F.silu(gate_proj) * value_proj

        # 3. Apply the final down-projection
        output = self.w_out(gated_output)

        return output

5. Key References

Canonical Paper on GLU Variants: This paper systematically studied different gating mechanisms in Transformers and found SwiGLU to be highly effective.
- Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202.
Prominent Application in SOTA Models: The Llama paper is a key example of a model that successfully incorporates SwiGLU as a core component, citing its benefits for performance.
- Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.