Quantizing Machine Learning Models: A Comprehensive Tutorial:

Table of Contents

Introduction
Mathematical Foundations of Quantization
Quantization Techniques
- 3.1 Quantizing Weights
- 3.2 Quantizing Activations
Post-Training Quantization (PTQ)
Quantization-Aware Training (QAT)
Comparing PTQ and QAT
- 6.1 Pros and Cons
- 6.2 When to Use Which Method
Mixed Precision Quantization
- 7.1 Rationale Behind Mixed Precision
- 7.2 Implementing Mixed Integer Precision Quantization
Layer Fusion in Quantization
- 8.1 Fusing Different Layers
- 8.2 Impact on Model Performance
Batch Normalization in Quantization
Common Issues and Debugging Techniques
- 10.1 Common Quantization Issues
- 10.2 Debugging Strategies
Quantizing Weights vs. Activations
- 11.1 Impact on Model Size Reduction
Conclusion
References

1. Introduction

With the proliferation of deep learning models in various applications, deploying these models on resource-constrained devices like mobile phones, embedded systems, and IoT devices has become essential. Quantization is a key technique that reduces the model size and computational requirements by converting floating-point numbers to lower-precision representations, such as integers.

This tutorial provides an in-depth exploration of quantizing machine learning models. We will delve into the mathematical underpinnings, practical implementations using PyTorch, and advanced topics like mixed precision quantization and layer fusion. By the end of this tutorial, you will have a comprehensive understanding of quantization techniques and how to apply them effectively to optimize your machine learning models.

2. Mathematical Foundations of Quantization

2.1 Understanding Quantization

Quantization in the context of machine learning refers to the process of mapping continuous-valued data (usually floating-point numbers) to discrete, finite sets of values (usually integers). This mapping reduces the precision of the numbers, which can significantly decrease memory usage and computational overhead.

Uniform Quantization is the most commonly used method, where the range of floating-point values is divided into equal intervals, and each interval is mapped to a discrete quantized value.

Affine Quantization introduces a zero-point offset to handle asymmetric ranges:

Quantization Function:

\[q = \text{Quantize}(x, s, z) = \text{clip}\left( \left\lfloor \frac{x}{s} \right\rceil + z, q_{\text{min}}, q_{\text{max}} \right)\]

\(x\): Original floating-point value
\(q\): Quantized integer value
\(s\): Scale factor
\(z\): Zero-point (integer)
\(\left\lfloor \cdot \right\rceil\): Rounding to nearest integer
\(q_{\text{min}}, q_{\text{max}}\): Minimum and maximum quantized values
Dequantization Function:

\[x \approx \text{Dequantize}(q, s, z) = s \times (q - z)\]

Key Points:

Scale Factor (\(s\)) determines the mapping between floating-point and quantized values.
Zero-Point (\(z\)) aligns the zero between the two representations, allowing for asymmetric ranges.

Relationship Between Zero-Point, Scale, Mean, and Standard Deviation:

Scale Factor (\(s\)) and Standard Deviation (\(\sigma\))

The scale factor \(s\) is often related to the spread of the data, which is captured by the standard deviation \(\sigma\). A common approach is to set the scale factor based on the standard deviation to ensure that the quantization bins cover the significant data distribution.

For instance:

\[s = \frac{2k \sigma}{q_{\text{max}} - q_{\text{min}}}\]

\(k\) is a constant (e.g., \(k = 3\)) to cover a certain confidence interval of the data (e.g., 99.7% for \(\pm 3\sigma\)).
Zero-Point (\(z\)) and Mean (\(\mu\))

The zero-point \(z\) is set to align the mean of the data distribution with the zero level of the quantized integer range, minimizing the quantization error around the mean value.

\[\begin{aligned} q_{\text{max}}&\approx \frac{\mu + k\sigma}{s} + z\\ q_{\text{min}}&\approx \frac{\mu - k\sigma}{s} + z\\ thus,\\ z &= -\frac{\mu}{s} + \text{zero}\_\text{level} \end{aligned}\]

\(\text{zero}\_\text{level}\) is typically \(\frac{q_{\text{min}} + q_{\text{max}}}{2}\) for symmetric quantization ranges.

Explanation:

By relating the scale factor to the standard deviation, we ensure that the quantization levels are appropriately spaced to capture the variability in the data.
Aligning the zero-point with the mean centers the quantization range around the most frequently occurring values, reducing the average quantization error.

Practical Implementation:

Scale Calculation:

Assuming the data range is \([A_{\text{min}}, A_{\text{max}}]\), where:

\[A_{\text{min}} = \mu - k\sigma, \quad A_{\text{max}} = \mu + k\sigma\]

The scale factor \(s\) is:

\[s = \frac{A_{\text{max}} - A_{\text{min}}}{q_{\text{max}} - q_{\text{min}}}\]

Zero-Point Calculation:

\[z = -\frac{\mu}{s} + \frac{q_{\text{min}} + q_{\text{max}}}{2}\]

Example:

Suppose we have:
- Mean \(\mu = 0\)
- Standard deviation \(\sigma = 0.1\)
- \(q_{\text{min}} = -128\), \(q_{\text{max}} = 127\) (for signed 8-bit integers)
- \(k = 3\) (to cover \(\pm 3\sigma\))
Then:
- \(A_{\text{min}} = -0.3\), \(A_{\text{max}} = 0.3\)
- \[s = \frac{0.3 - (-0.3)}{127 - (-128)} = \frac{0.6}{255} \approx 0.00235\]
- \(z = -\frac{0}{0.00235} + 0 = 0\) (since mean is zero)

Conclusion:

The scale factor \(s\) is proportional to the standard deviation \(\sigma\), determining the granularity of quantization levels.
The zero-point \(z\) is related to the mean \(\mu\), aligning the quantization levels with the data distribution.

2.2 Mean and Variance of Tensors

The mean and variance of tensors are fundamental statistical measures that describe the distribution of tensor values. They are crucial in determining appropriate quantization parameters.

Mean (\(\mu\)) of a Tensor:

\[\mu = \frac{1}{N} \sum_{i=1}^{N} x_i\]

\(N\): Number of elements in the tensor
\(x_i\): Individual elements of the tensor
Variance (\(\sigma^2\)) of a Tensor:

\[\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2\]

These statistics help in:

Choosing Quantization Ranges: Determining the minimum and maximum values for quantization.
Reducing Quantization Error: By understanding the distribution, we can minimize the error introduced during quantization.

2.3 Role of Histograms in Quantization

Histograms provide a visual representation of the distribution of tensor values, showing how frequently each value or range of values occurs.

In Quantization:

Range Determination: Histograms help identify the range within which most tensor values lie.
Outlier Detection: They reveal outliers that may disproportionately affect quantization if included in the range.
Optimal Clipping: By analyzing histograms, we can decide on clipping thresholds to exclude extreme values, reducing quantization error.

Mathematical Formulation:

Histogram Binning:

The tensor values are divided into \(B\) bins over the range \([x_{\text{min}}, x_{\text{max}}]\). The bin width \(w\) is:

\[w = \frac{x_{\text{max}} - x_{\text{min}}}{B}\]

Frequency Count:

For each bin \(b\), count the number of tensor values \(n_b\) that fall into that bin.

Using Histograms to Compute Quantization Parameters:

Optimal Scale (\(s\)) and Zero-Point (\(z\)) Selection:

By minimizing the quantization error \(E\):

\[E = \sum_{i=1}^{N} (x_i - \text{Dequantize}(\text{Quantize}(x_i, s, z), s, z))^2\]

We can adjust \(s\) and \(z\) to minimize \(E\) based on the histogram data.

3. Quantization Techniques

3.1 Quantizing Weights

Weights in neural networks are the parameters learned during training. Quantizing weights reduces model size and speeds up inference.

Methods:

Per-Tensor Quantization: A single scale and zero-point for the entire weight tensor.
Per-Channel Quantization: Different scales and zero-points for each output channel, which can significantly improve accuracy, especially in convolutional layers.

Per-Channel Quantization Formula:

For each output channel \(k\):

\[q_{W_k} = \text{Quantize}(W_k, s_k, z_k)\]

\(W_k\): Weights for channel \(k\)
\(s_k\), \(z_k\): Scale and zero-point for channel \(k\)

Benefits:

Reduced Quantization Error: By tailoring quantization parameters to each channel.
Better Representation: Accounts for the varying distributions across channels.

3.2 Quantizing Activations

Activations are the outputs of layers during inference. Quantizing activations reduces memory bandwidth and computational cost.

Challenges:

Activations are data-dependent and vary with each input.
Need to ensure that the quantization parameters cover the dynamic range of activations.

Static vs. Dynamic Quantization:

Static Quantization:
- Quantization parameters are determined during a calibration phase using a representative dataset.
- Advantage: Consistent performance since parameters are fixed.
- Disadvantage: Requires a calibration step.
Dynamic Quantization:
- Quantization parameters are calculated on-the-fly during inference.
- Advantage: No calibration needed.
- Disadvantage: Increased computational overhead and potential variability in performance.

Activation Quantization Formula:

\[q_A = \text{Quantize}(A, s_A, z_A)\]

\(A\): Activation tensor
\(s_A\), \(z_A\): Scale and zero-point for activations

Determining Activation Quantization Parameters:

Calibration Dataset: Run a set of inputs through the model and collect statistics.
Histogram Collection: Build histograms of activations to determine optimal ranges.

4. Post-Training Quantization (PTQ)

4.1 Overview of PTQ

Post-Training Quantization converts a pre-trained floating-point model into a quantized model without additional training. It’s an effective method when retraining is not feasible.

Process:

Model Conversion: Replace floating-point operations with quantized equivalents.
Calibration: Use a calibration dataset to collect activation statistics. This is needed to quantize the activations. The weights are simply quantized based on the maximum and minimum values(per channel or per tensor). But to quantize the activations we need a calibration step to find the max and min of the activations by feeding the model with different inputs.
Parameter Assignment: Compute and assign quantization parameters based on collected statistics.

4.2 Mathematical Formulation

Calibration Data Collection:

For each layer’s activations \(A\), collect min and max values:

\[A_{\text{min}} = \min(A)\] \[A_{\text{max}} = \max(A)\]

Compute the scale (\(s_A\)) and zero-point (\(z_A\)):

\[s_A = \frac{A_{\text{max}} - A_{\text{min}}}{q_{\text{max}} - q_{\text{min}}}\] \[z_A = q_{\text{min}} - \frac{A_{\text{min}}}{s_A}\]

Quantization of Weights and Activations:

Use the quantization function with computed \(s\) and \(z\) to quantize weights and activations.

4.3 Implementing PTQ in PyTorch

Let’s implement PTQ on a simple convolutional neural network.

Step 1: Define the Model

import torch
import torch.nn as nn

class PTQModel(nn.Module):
    def __init__(self):
        super(PTQModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)  # [batch, 3, H, W] -> [batch, 16, H, W]
        self.bn1 = nn.BatchNorm2d(16)
        self.relu1 = nn.ReLU()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)  # [batch, 16, H, W] -> [batch, 16, H/2, W/2]
        self.fc = nn.Linear(16 * 16 * 16, 10)  # Assuming input images are 32x32

    def forward(self, x):
        x = self.conv1(x)             # [batch, 16, 32, 32]
        x = self.bn1(x)               # [batch, 16, 32, 32]
        x = self.relu1(x)             # [batch, 16, 32, 32]
        x = self.pool(x)              # [batch, 16, 16, 16]
        x = x.view(x.size(0), -1)     # [batch, 16*16*16]
        x = self.fc(x)                # [batch, 10]
        return x

model_fp32 = PTQModel()

Step 2: Prepare the Model for Quantization

from torch.quantization import fuse_modules

model_fp32.eval()

# Fuse modules
fuse_modules(model_fp32, [['conv1', 'bn1', 'relu1']], inplace=True)

Explanation:

Module Fusion: Combines conv1, bn1, and relu1 into a single operation for efficiency and better quantization.

Step 3: Specify Quantization Configuration

import torch.quantization as quantization

model_fp32.qconfig = quantization.default_qconfig  # Use default quantization configuration

# Prepare the model for static quantization
quantization.prepare(model_fp32, inplace=True)

Step 4: Calibration

# Define a calibration function
def calibrate(model, data_loader):
    model.eval()
    with torch.no_grad():
        for images, _ in data_loader:
            model(images)  # Run images through the model to collect activation statistics

# Assume data_loader is defined and provides calibration data
# calibrate(model_fp32, data_loader)

Step 5: Convert to Quantized Model

# Convert the model to quantized version
model_int8 = quantization.convert(model_fp32)

Step 6: Evaluate the Quantized Model

# Evaluate the quantized model on a test dataset
def evaluate(model, data_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in data_loader:
            outputs = model(images)  # Outputs are quantized
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    print('Accuracy of the quantized model: %d %%' % (100 * correct / total))

# Assume test_loader is defined
# evaluate(model_int8, test_loader)

5. Detailed Explanation of Quantization-Aware Training (QAT)

Introduction

Quantization-Aware Training (QAT) is a technique used to simulate the effects of quantization during the training of a neural network. The goal is to prepare the model to better handle the reduced precision of quantized representations (such as 8-bit integers) when deployed on hardware that supports lower-precision arithmetic. By incorporating quantization effects into the training process, QAT helps mitigate the accuracy loss that typically accompanies post-training quantization methods.

What is Quantization-Aware Training?

Quantization-Aware Training integrates quantization operations into the training graph of a neural network. It does this by:

Inserting Fake Quantization Nodes into the computation graph during training.
Simulating Quantization Effects on weights and activations without actually changing their underlying data types (they remain in floating-point).
Allowing Gradients to Flow through these fake quantization nodes during backpropagation.

The result is a model that has learned to be robust to the quantization errors that will be present during inference when the model is fully quantized.

Understanding Fake Quantization

* What is Fake Quantization?*

Fake Quantization refers to the practice of simulating quantization operations within the training graph without changing the data types of the tensors involved. Essentially, the values are quantized and dequantized back to floating-point within the forward pass, but the underlying data type remains the same (typically 32-bit floating-point). This allows for:

Simulating Quantization Errors: The model experiences the quantization effects during training.
Maintaining Floating-Point Precision for Updates: The parameters are updated using high-precision floating-point values during backpropagation.

How Fake Quantization Works

In practice, fake quantization involves:

Quantizing a Tensor: Mapping the floating-point values to a lower-precision representation (e.g., 8-bit integers).
Dequantizing Back to Floating-Point: Converting the quantized values back to floating-point numbers before further processing.

Mathematically:

Given a floating-point tensor \(x\), fake quantization is applied as:

Quantization:
\[q = \text{Quantize}(x) = \left\lfloor \frac{x}{s} \right\rceil + z\]
Dequantization:
\[\hat{x} = \text{Dequantize}(q) = s (q - z)\]
Result:
- \(\hat{x}\) is the fake-quantized tensor used in subsequent computations.
- \(x\) and \(\hat{x}\) are both in floating-point format, but \(\hat{x}\) simulates the quantization error.

Variables:

\(s\): Scale factor.
\(z\): Zero-point.
\(\left\lfloor \cdot \right\rceil\): Rounding to the nearest integer.

Computational Graphs in QAT

How Many Graphs are Used?

In QAT, the computational graph consists of:

Forward Graph:
- Contains fake quantization nodes.
- Simulates quantization during the forward pass.
Backward Graph:
- Allows gradients to flow through fake quantization nodes.
- Uses the Straight-Through Estimator (STE) to approximate gradients through non-differentiable quantization operations.

Note: Although we often think of separate forward and backward graphs, they are part of the same computational graph in frameworks like TensorFlow and PyTorch, with automatic differentiation handling the backward pass.

Detailed Explanation of the Forward and Backward Passes in QAT

Forward Pass with Fake Quantization

Input Quantization:
- Inputs to layers are fake-quantized.
- Simulates the quantization of activations.
Weight Quantization:
- Weights are fake-quantized before convolution or matrix multiplication.
- Simulates quantization of weights.
Layer Computations:
- Operations are performed using the fake-quantized inputs and weights.
- Outputs are fake-quantized for subsequent layers.

Illustration:

For a linear layer:

Input: \(x\)
Weights: \(W\)
Biases: \(b\)

The computation is:

\[\begin{aligned} \hat{x} &= \text{FakeQuantize}(x) \\ \hat{W} &= \text{FakeQuantize}(W) \\ y &= \hat{x} \cdot \hat{W} + b \\ \hat{y} &= \text{FakeQuantize}(y) \end{aligned}\]

Backward Pass with Straight-Through Estimator (STE)

Challenge:

Quantization involves rounding operations, which are non-differentiable.
Direct computation of gradients through these operations is not possible.

Solution: Straight-Through Estimator (STE)

STE approximates the gradient of the quantization function as if it were an identity function in the backward pass.

Mathematical Representation:

Given the quantization function:

\[q = \left\lfloor \frac{x}{s} \right\rceil + z\]

The derivative \(\frac{\partial q}{\partial x}\) is zero almost everywhere due to the rounding operation. Using STE:

\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial q} \cdot \frac{\partial q}{\partial x} \approx \frac{\partial L}{\partial q}\]

Assumption: \(\frac{\partial q}{\partial x} \approx 1\) within the quantization range.
Result: Allows gradients to pass through the quantization nodes unchanged during backpropagation.

Intuition Behind STE

Idea: Treat the quantization operation as if it were the identity function during backpropagation.
Purpose: Enables the model to learn how to adjust weights despite the non-differentiable quantization steps.
Effect: The model parameters are updated as if there were no quantization during training, but the forward pass still simulates quantization errors.

Detailed Steps in Quantization-Aware Training

1. Model Preparation

Insert Fake Quantization Nodes:
- At appropriate points in the model (after activations, before weights).
- Use special layers or functions provided by the deep learning framework.
Configure Quantization Parameters:
- Define scale factors and zero-points for weights and activations.
- Can be learned during training or set based on calibration data.

2. Training Loop

Forward Pass:
- Compute outputs using fake-quantized weights and activations.
- Simulate quantization effects throughout the model.
Loss Computation:
- Calculate the loss using the outputs from the fake-quantized model.
- The loss reflects the model’s performance under quantization.
Backward Pass:
- Compute gradients using STE through fake quantization nodes.
- Update model parameters using an optimizer.

3. Convergence and Fine-Tuning

Convergence:
- Training continues until the model converges or meets performance criteria.
- The model learns to compensate for quantization errors.
Fine-Tuning:
- Optional phase to further adjust quantization parameters.
- Can involve additional calibration or adjustment of scale factors.

4. Model Conversion for Inference

Replace Fake Quantization Nodes:
- Convert fake quantization nodes to actual quantization operations.
- Change data types from floating-point to integers where appropriate.
Export Quantized Model:
- Save the model in a format suitable for deployment on target hardware.
- Ensure compatibility with hardware-specific quantization support.

Example Implementation in PyTorch

Note: This example assumes familiarity with PyTorch’s quantization APIs.

Step 1: Define the Model with Quantization Stubs

import torch
import torch.nn as nn
import torch.quantization as quantization

class QATModel(nn.Module):
    def __init__(self):
        super(QATModel, self).__init__()
        # Define layers
        self.quant = quantization.QuantStub()     # Quantization stub
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(16)
        self.relu1 = nn.ReLU()
        self.fc = nn.Linear(16 * 32 * 32, 10)
        self.dequant = quantization.DeQuantStub() # Dequantization stub

    def forward(self, x):
        x = self.quant(x)             # Fake quantize input
        x = self.conv1(x)             # Convolution
        x = self.bn1(x)               # BatchNorm
        x = self.relu1(x)             # ReLU activation
        x = x.view(x.size(0), -1)     # Flatten
        x = self.fc(x)                # Fully connected layer
        x = self.dequant(x)           # Dequantize output
        return x

model = QATModel()

Step 2: Prepare the Model for QAT

Fuse Modules:

# Fuse Conv, BatchNorm, and ReLU
model.train()
model.fuse_model = True

def fuse_model(model):
    quantization.fuse_modules(model, [['conv1', 'bn1', 'relu1']], inplace=True)

if model.fuse_model:
    fuse_model(model)

Set QAT Configuration:

# Set QAT configuration
model.qconfig = quantization.get_default_qat_qconfig('fbgemm')  # For x86 CPUs
# Prepare for QAT
quantization.prepare_qat(model, inplace=True)

Step 3: Train the Model

Training Loop:

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

num_epochs = 5
for epoch in range(num_epochs):
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)          # Forward pass with fake quantization
        loss = criterion(outputs, labels)
        loss.backward()                  # Backward pass with STE
        optimizer.step()

Step 4: Convert to a Quantized Model

Convert for Inference:

model.eval()
model_int8 = quantization.convert(model)

Explanation:
- The convert function replaces fake quantization nodes with actual quantized operations and changes data types where possible.

Understanding Fake Quantization Through a Numerical Example

Introduction

To help you understand how fake quantization during training simulates the quantized inference model, I’ll provide a detailed numerical example. We’ll use the same numerical data for both training and inference. This example will demonstrate:

How fake quantization introduces quantization errors during training.
How these errors simulate the actual quantization effects during inference.
How the model learns to compensate for these errors.

Example Setup

We’ll consider a simple neural network with a single linear layer (fully connected layer) without any activation function for simplicity. The network performs the following computation:

\[y = W x + b\]

Where:

\(x\) is the input vector.
\(W\) is the weight matrix.
\(b\) is the bias vector.
\(y\) is the output vector.

Given Numerical Data

Input vector \(x = \begin{bmatrix} 1.5 \\ -2.3 \end{bmatrix}\)
Weight matrix \(W = \begin{bmatrix} 0.8 & -0.5 \\ 1.2 & 0.3 \end{bmatrix}\)
Bias vector \(b = \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix}\)

Quantization Parameters

Scale factor \(s = 0.2\)
Zero-point \(z = 0\)
Quantization bit-width: 8 bits (values range from (-128) to (127) for signed integers)

Part 1: Training with Fake Quantization

During training, we simulate quantization effects using fake quantization. The computations remain in floating-point precision, but quantization errors are introduced.

Step 1: Fake Quantize the Input Activations

Quantize \(x\):

\[q_x = \left\lfloor \dfrac{x}{s} \right\rceil + z = \left\lfloor \dfrac{\begin{bmatrix} 1.5 \\ -2.3 \end{bmatrix}}{0.2} \right\rceil + 0 = \left\lfloor \begin{bmatrix} 7.5 \\ -11.5 \end{bmatrix} \right\rceil = \begin{bmatrix} 8 \\ -12 \end{bmatrix}\]

Dequantize \(q_x\) to get \(\hat{x}\):

\[\hat{x} = s (q_x - z) = 0.2 \times \begin{bmatrix} 8 \\ -12 \end{bmatrix} = \begin{bmatrix} 1.6 \\ -2.4 \end{bmatrix}\]

Quantization Error in \(x\):

\[\Delta x = \hat{x} - x = \begin{bmatrix} 1.6 - 1.5 \\ -2.4 - (-2.3) \end{bmatrix} = \begin{bmatrix} 0.1 \\ -0.1 \end{bmatrix}\]

Step 2: Fake Quantize the Weights

Quantize \(W\):

\[q_W = \left\lfloor \dfrac{W}{s} \right\rceil + z = \left\lfloor \dfrac{\begin{bmatrix} 0.8 & -0.5 \\ 1.2 & 0.3 \end{bmatrix}}{0.2} \right\rceil + 0 = \left\lfloor \begin{bmatrix} 4 & -2.5 \\ 6 & 1.5 \end{bmatrix} \right\rceil = \begin{bmatrix} 4 & -2 \\ 6 & 2 \end{bmatrix}\]

Dequantize \(q_W\) to get \(\hat{W}\):

\[\hat{W} = s (q_W - z) = 0.2 \times \begin{bmatrix} 4 & -2 \\ 6 & 2 \end{bmatrix} = \begin{bmatrix} 0.8 & -0.4 \\ 1.2 & 0.4 \end{bmatrix}\]

Quantization Error in \(W\):

\[\Delta W = \hat{W} - W = \begin{bmatrix} 0.8 - 0.8 & -0.4 - (-0.5) \\ 1.2 - 1.2 & 0.4 - 0.3 \end{bmatrix} = \begin{bmatrix} 0 & 0.1 \\ 0 & 0.1 \end{bmatrix}\]

Step 3: Compute Output Using Fake Quantized Values

Compute \(\hat{y} = \hat{W} \hat{x} + b\):

Matrix Multiplication:
\[\hat{W} \hat{x} = \begin{bmatrix} 0.8 & -0.4 \\ 1.2 & 0.4 \end{bmatrix} \begin{bmatrix} 1.6 \\ -2.4 \end{bmatrix} = \begin{bmatrix} (0.8)(1.6) + (-0.4)(-2.4) \\ (1.2)(1.6) + (0.4)(-2.4) \end{bmatrix} = \begin{bmatrix} 1.28 + 0.96 \\ 1.92 - 0.96 \end{bmatrix} = \begin{bmatrix} 2.24 \\ 0.96 \end{bmatrix}\]
Add Bias:
\[\hat{y} = \hat{W} \hat{x} + b = \begin{bmatrix} 2.24 \\ 0.96 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix} = \begin{bmatrix} 2.34 \\ 0.76 \end{bmatrix}\]

Step 4: Compute Original Output Without Quantization

Compute \(y = W x + b\):

Matrix Multiplication:
\[W x = \begin{bmatrix} 0.8 & -0.5 \\ 1.2 & 0.3 \end{bmatrix} \begin{bmatrix} 1.5 \\ -2.3 \end{bmatrix} = \begin{bmatrix} (0.8)(1.5) + (-0.5)(-2.3) \\ (1.2)(1.5) + (0.3)(-2.3) \end{bmatrix} = \begin{bmatrix} 1.2 + 1.15 \\ 1.8 - 0.69 \end{bmatrix} = \begin{bmatrix} 2.35 \\ 1.11 \end{bmatrix}\]
Add Bias:
\[y = W x + b = \begin{bmatrix} 2.35 \\ 1.11 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix} = \begin{bmatrix} 2.45 \\ 0.91 \end{bmatrix}\]

Step 5: Quantization Error in Output

Quantization Error in \(y\):

\[\Delta y = \hat{y} - y = \begin{bmatrix} 2.34 - 2.45 \\ 0.76 - 0.91 \end{bmatrix} = \begin{bmatrix} -0.11 \\ -0.15 \end{bmatrix}\]

Interpretation During Training:

The model experiences quantization errors in inputs, weights, and outputs.
During backpropagation, gradients flow through the fake quantization nodes using the Straight-Through Estimator (STE).
The model adjusts its weights and biases to minimize the loss, learning to compensate for quantization errors.

Part 2: Inference with Actual Quantization

During inference, the model uses quantized weights and activations, and computations are performed using integer arithmetic where possible.

Step 1: Quantize the Weights (Already Done)

Quantized Weights: \(q_W = \begin{bmatrix} 4 & -2 \\ 6 & 2 \end{bmatrix}\) (stored as integers)

Step 2: Quantize the Input Activations

Assuming the same input:

\[x_{\text{inference}} = \begin{bmatrix} 1.5 \\ -2.3 \end{bmatrix}\]

Quantize \(x_{\text{inference}}\):

\[q_x = \left\lfloor \dfrac{x_{\text{inference}}}{s} \right\rceil + z = \left\lfloor \dfrac{\begin{bmatrix} 1.5 \\ -2.3 \end{bmatrix}}{0.2} \right\rceil + 0 = \begin{bmatrix} 8 \\ -12 \end{bmatrix}\]

Step 3: Compute Output Using Quantized Values

Compute \(q_y = q_W q_x\) Using Integer Arithmetic:

\[q_y = q_W q_x = \begin{bmatrix} 4 & -2 \\ 6 & 2 \end{bmatrix} \begin{bmatrix} 8 \\ -12 \end{bmatrix} = \begin{bmatrix} (4)(8) + (-2)(-12) \\ (6)(8) + (2)(-12) \end{bmatrix} = \begin{bmatrix} 32 + 24 \\ 48 - 24 \end{bmatrix} = \begin{bmatrix} 56 \\ 24 \end{bmatrix}\]

Compute the Output Scale Factor \(s_y\):

\[s_y = s_W \times s_x = 0.2 \times 0.2 = 0.04\]

Dequantize \(q_y\) to Get \(\hat{y}\):

\[\hat{y} = s_y q_y + b = 0.04 \times \begin{bmatrix} 56 \\ 24 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.2 \end{bmatrix} = \begin{bmatrix} 2.24 + 0.1 \\ 0.96 - 0.2 \end{bmatrix} = \begin{bmatrix} 2.34 \\ 0.76 \end{bmatrix}\]

Step 4: Quantization Error in Output

Compare with Original Output \(y\):

\[y = \begin{bmatrix} 2.45 \\ 0.91 \end{bmatrix}\]

Quantization Error in \(y\):

\[\Delta y_{\text{inference}} = \hat{y} - y = \begin{bmatrix} 2.34 - 2.45 \\ 0.76 - 0.91 \end{bmatrix} = \begin{bmatrix} -0.11 \\ -0.15 \end{bmatrix}\]

Observation:

The quantization errors during inference are identical to those experienced during training with fake quantization.
This shows that fake quantization during training accurately simulates the quantization effects during inference.

Understanding the Simulation

How Fake Quantization Simulates Quantized Inference

Introduction of Quantization Errors:
- During training, fake quantization quantizes and dequantizes the activations and weights, introducing quantization errors.
- These errors affect the outputs, just like actual quantization during inference.
Learning to Compensate:
- The model learns to adjust its parameters to minimize the loss, despite the presence of quantization errors.
- This results in a model that is robust to quantization.
Consistency Between Training and Inference:
- Since the quantization errors are the same, the model’s performance during inference closely matches its performance during training.
- This minimizes the accuracy drop typically seen when quantizing a model post-training without QAT.

Key Points

Fake Quantization:
- Simulates quantization effects during training without changing the data types.
- Introduces quantization errors by quantizing and dequantizing tensors.
Quantization Errors:
- Errors introduced due to the limited precision of quantized representations.
- Affect the outputs of computations.
Straight-Through Estimator (STE):
- Used during backpropagation to approximate gradients through non-differentiable quantization operations.
- Assumes the derivative of the quantization function is 1 within the quantization range.
Model Adaptation:
- The model adjusts its weights and biases to minimize the loss in the presence of quantization errors.
- This leads to better performance when the model is quantized for inference.

Additional Notes

In Quantization Aware Training (QAT), bias terms are typically not quantized and are instead maintained in higher precision, such as floating-point (e.g., FP32). This practice ensures that the addition of biases during the forward pass remains accurate, as quantizing biases can introduce significant errors that may degrade the model’s performance. By keeping biases in higher precision, QAT effectively balances the benefits of reduced precision for weights and activations with the need for precise bias calculations, thereby maintaining overall model accuracy while still leveraging the efficiency gains from quantization.
In Quantization Aware Training (QAT), QuantStub and DeQuantStub are generally placed at the very beginning and end of the model, respectively, to quantize the input tensors and dequantize the output tensors. However, there are exceptions, especially in complex architectures with multiple branches or residual connections. In such cases, additional QuantStub and DeQuantStub instances may be needed within the network to ensure that intermediate tensors are correctly quantized and dequantized.

import torch
import torch.nn as nn
import torch.quantization

class MultiBranchModel(nn.Module):
    def __init__(self):
        super(MultiBranchModel, self).__init__()
        # Quantization stubs
        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()
        
        # Branch 1
        self.branch1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(16, 16, kernel_size=3, padding=1)
        )
        
        # Branch 2
        self.branch2 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.Conv2d(16, 16, kernel_size=5, padding=2)
        )
        
        # Fusion layer
        self.fuse = nn.Conv2d(32, 32, kernel_size=1)
        self.relu = nn.ReLU()
        
        # Output layer
        self.output = nn.Linear(32 * 32 * 32, 10)  # Assuming input images are 32x32

        # Additional Quant/DeQuant stubs for branches
        self.quant_branch1 = torch.quantization.QuantStub()
        self.dequant_branch1 = torch.quantization.DeQuantStub()
        self.quant_branch2 = torch.quantization.QuantStub()
        self.dequant_branch2 = torch.quantization.DeQuantStub()

    def forward(self, x):
        # Quantize the input
        x = self.quant(x)
        
        # Branch 1
        b1 = self.quant_branch1(x)
        b1 = self.branch1(b1)
        b1 = self.dequant_branch1(b1)
        
        # Branch 2
        b2 = self.quant_branch2(x)
        b2 = self.branch2(b2)
        b2 = self.dequant_branch2(b2)
        
        # Concatenate branches
        fused = torch.cat([b1, b2], dim=1)
        fused = self.fuse(fused)
        fused = self.relu(fused)
        
        # Flatten and output
        fused = fused.view(fused.size(0), -1)
        out = self.output(fused)
        
        # Dequantize the output
        out = self.dequant(out)
        return out

# Instantiate the model
model = MultiBranchModel()

# Specify quantization configuration
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

# Fuse modules if applicable (e.g., Conv + ReLU)
# In this example, fusion is done manually for branches
model.fuse_modules = [
    ['branch1.0', 'branch1.1'],
    ['branch1.1', 'branch1.2'],
    ['branch2.0', 'branch2.1'],
    ['branch2.1', 'branch2.2']
]

# Apply fusion
for fuse_pair in model.fuse_modules:
    torch.quantization.fuse_modules(model, fuse_pair, inplace=True)

# Prepare the model for QAT
torch.quantization.prepare_qat(model, inplace=True)

# The model is now ready for QAT training

Intuitions Behind Quantization-Aware Training

1. Learning to Compensate for Quantization Errors

By exposing the model to quantization effects during training, it can adjust its parameters to minimize the impact of quantization.
Weights and biases are optimized to produce accurate results even when quantized.

2. Smooth Transition from Training to Inference

Since the model experiences quantization during training, the transition to the quantized inference model is smoother.
Reduces the discrepancy between training and inference behaviors.

3. Preserving Model Capacity

Unlike aggressive post-training quantization, QAT allows the model to retain more of its representational capacity.
The model learns representations that are inherently more robust to reduced precision.

Backpropagation in Quantization-Aware Training

Handling Non-Differentiable Operations

Problem: Quantization functions involve rounding, which is non-differentiable.
Solution: Use the Straight-Through Estimator (STE) to approximate gradients.

Mathematical Formulation

Forward Pass:
\[\hat{x} = s \cdot \left( \left\lfloor \frac{x}{s} \right\rceil + z \right) - s \cdot z\]
- \(x\): Input tensor.
- \(s\): Scale factor.
- \(z\): Zero-point.
- \(\hat{x}\): Fake-quantized tensor.
Backward Pass with STE:
\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \hat{x}} \cdot \frac{\partial \hat{x}}{\partial x} \approx \frac{\partial L}{\partial \hat{x}} \cdot 1\]
- Assumption: \(\frac{\partial \hat{x}}{\partial x} \approx 1\) within the quantization range.

Practical Implications

Gradient Flow: Allows gradients to pass through fake quantization nodes as if they were identity functions.
Weight Updates: Parameters are updated based on the loss computed with quantization effects included.

Benefits of Quantization-Aware Training

Improved Accuracy:
- Models trained with QAT typically achieve higher accuracy compared to models quantized post-training.
Better Generalization:
- By learning with quantization noise, the model may generalize better to new data.
Compatibility with Hardware:
- Prepares the model for deployment on hardware that requires or benefits from quantized computations.

Considerations and Best Practices

Calibration

Importance: Accurate scale factors and zero-points are critical.
Method: Use a representative calibration dataset to collect statistics.

Learning Quantization Parameters

Some frameworks allow the scale factors to be learned during training.
This can lead to better alignment between quantization parameters and data distribution.

Choosing Quantization Bits

Common Practice: Use 8-bit quantization for a balance between efficiency and accuracy.
Lower Bit-Widths: More aggressive quantization (e.g., 4-bit) may require more careful tuning and can benefit more from QAT.

Layer Fusion

Purpose: Combining layers (e.g., Conv + BatchNorm + ReLU) reduces quantization errors between layers.
Impact: Simplifies the computation graph and improves efficiency.

Avoiding Saturation and Clipping

Issue: Activations may saturate the quantization range, leading to information loss.
Solution: Adjust the quantization ranges or use techniques like quantile-based scaling.

6. Comparing PTQ and QAT

6.1 Pros and Cons

Post-Training Quantization (PTQ):

Pros:
- Simplicity: Easy to implement without modifying the training pipeline.
- No Additional Training: Does not require retraining the model.
- Fast Deployment: Quick way to reduce model size and latency.
Cons:
- Accuracy Loss: May suffer significant accuracy degradation, especially for complex models or low-bit quantization.
- Limited Optimization: Cannot compensate for quantization errors through training.

Quantization-Aware Training (QAT):

Pros:
- Higher Accuracy: Retains model performance closer to the original floating-point model.
- Robustness to Quantization Errors: Model learns to adjust weights to minimize quantization impact.
Cons:
- Training Overhead: Requires additional training time and computational resources.
- Complexity: Involves modifying the training pipeline and managing quantization configurations.

6.2 When to Use Which Method

PTQ is suitable when:
- The model is small and not sensitive to quantization.
- Quick deployment is needed without access to training infrastructure.
- Minor accuracy loss is acceptable.
QAT is suitable when:
- The model is large or complex, and accuracy is critical.
- You have access to training data and computational resources.
- You aim for the best possible performance in a quantized model.

7. Mixed Precision Quantization

7.1 Rationale Behind Mixed Precision

Mixed Precision Quantization involves using different bit-widths for different parts of the model. This approach aims to balance the trade-off between model size, computational efficiency, and accuracy.

Reasons to Use Mixed Precision:

Layer Sensitivity: Some layers are more sensitive to quantization than others (e.g., first and last layers).
Performance Optimization: Using higher precision where necessary and lower precision elsewhere can optimize performance without significant accuracy loss.
Hardware Constraints: Some hardware accelerators support mixed precision operations.

7.2 Implementing Mixed Integer Precision Quantization

Step 1: Define Custom Quantization Configurations

from torch.quantization.observer import MinMaxObserver
from torch.quantization.fake_quantize import FakeQuantize

# Custom observer for 4-bit quantization
def custom_4bit_observer():
    return MinMaxObserver(quant_min=0, quant_max=15, dtype=torch.quint8, qscheme=torch.per_tensor_affine)

# Custom observer for 8-bit quantization
def custom_8bit_observer():
    return MinMaxObserver(quant_min=0, quant_max=255, dtype=torch.quint8, qscheme=torch.per_tensor_affine)

# Custom qconfig for 4-bit activations and weights
qconfig_4bit = torch.quantization.QConfig(
    activation=FakeQuantize.with_args(observer=custom_4bit_observer, quant_min=0, quant_max=15),
    weight=FakeQuantize.with_args(observer=custom_4bit_observer, quant_min=0, quant_max=15)
)

# Default qconfig for 8-bit quantization
qconfig_8bit = torch.quantization.get_default_qat_qconfig('fbgemm')

Step 2: Assign QConfig to Model Layers

# Apply 8-bit quantization to layers that are sensitive
model.conv1.qconfig = qconfig_8bit
model.fc.qconfig = qconfig_8bit

# Apply 4-bit quantization to less sensitive layers
# Assuming the model has other layers like conv2, conv3
# model.conv2.qconfig = qconfig_4bit
# model.conv3.qconfig = qconfig_4bit

Step 3: Prepare the Model for QAT

# Prepare the model for quantization-aware training
torch.quantization.prepare_qat(model, inplace=True)

Step 4: Train the Model

Training proceeds as before, but with mixed precision configurations applied.

Step 5: Convert to Quantized Model

# Convert the model to quantized version
model.eval()
model_int8 = torch.quantization.convert(model)

Considerations:

Hardware Support: Ensure that the target deployment hardware supports the desired precisions.
Accuracy Testing: Validate the model’s performance, as lower bit-widths can introduce more quantization error.

8. Layer Fusion in Quantization

8.1 Fusing Different Layers

Layer Fusion combines multiple sequential operations into a single operation. This is particularly important for quantized models to improve efficiency and reduce quantization error.

Common Fusions:

Convolution + BatchNorm
Convolution + BatchNorm + ReLU
Linear + ReLU

Benefits:

Reduced Memory Accesses: Fewer separate operations mean less data movement.
Improved Performance: Single fused operation can be optimized better.
Better Quantization: Minimizes the impact of quantization between layers.

8.2 Impact on Model Performance

Mathematical Formulation:

Fusing Convolution and BatchNorm:

Given a convolution output:

\[y = W * x + b\]

And batch normalization:

\[y_{\text{bn}} = \gamma \frac{y - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

The fused weights and biases are:

\[W_{\text{fused}} = W \cdot \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}}\] \[b_{\text{fused}} = \left( b - \mu \right) \cdot \frac{\gamma}{\sqrt{\sigma^2 + \epsilon}} + \beta\]

Impact:

Efficiency: Reduces the number of operations during inference.
Quantization Error Reduction: By fusing layers before quantization, we avoid quantizing intermediate results, which can accumulate errors.

Implementation in PyTorch:

from torch.quantization import fuse_modules

# Fuse Conv, BatchNorm, and ReLU layers
fuse_modules(model, [['conv1', 'bn1', 'relu1']], inplace=True)

9. Batch Normalization in Quantization

9.1 Behavior During Training and Inference

Batch Normalization (BatchNorm) normalizes the input of a layer using the mean and variance calculated over the mini-batch.

During Training:

Uses Batch Statistics: Mean and variance are computed over the current batch.
Updates Running Estimates: Maintains running estimates of mean and variance for inference.

During Inference:

Uses Running Estimates: Fixed mean and variance computed during training.

Impact on Quantization:

Quantization Discrepancies: Quantizing BatchNorm layers can introduce errors due to the mismatch between training and inference behaviors.
Layer Fusion Importance: Fusing BatchNorm with preceding layers mitigates these issues.

9.2 Handling BatchNorm Issues

Strategies:

Fold BatchNorm into Previous Layers:
- By folding, we eliminate the BatchNorm layer, incorporating its effect into the weights and biases of the preceding layer.
Disable BatchNorm during Quantization:
- Freeze BatchNorm layers to use the running estimates and prevent further updates.
Use GroupNorm or LayerNorm:
- As alternatives that do not depend on batch statistics.

Implementation Example:

# Before folding
conv_output = conv_layer(input)
bn_output = batchnorm_layer(conv_output)

# After folding
# Adjust weights and biases of conv_layer

9.3 Sync-BatchNorm

Sync-BatchNorm synchronizes the computation of batch statistics across multiple devices during distributed training.

Benefits:

Consistent Statistics: Ensures that all devices use the same mean and variance.
Improved Accuracy: Especially important when the batch size per device is small.

Implementation in PyTorch:

# Convert BatchNorm layers to SyncBatchNorm
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Considerations:

Distributed Training: Requires torch.distributed to be initialized.
Communication Overhead: May introduce additional communication between devices.

10. Common Issues and Debugging Techniques

10.1 Common Quantization Issues

Significant Accuracy Drop:
- Cause: Quantization introduces errors that the model cannot compensate for.
- Symptoms: Model accuracy significantly lower than the floating-point version.
Activation Clipping:
- Cause: Activation ranges not properly calibrated, leading to clipping of values.
- Symptoms: Degraded performance due to loss of information.
Unsupported Operations:
- Cause: Certain operations or layers are not compatible with quantization.
- Symptoms: Errors during model conversion or inference.
Mismatch Between Training and Inference:
- Cause: Differences in BatchNorm behavior or other training-specific behaviors.
- Symptoms: Model performs well during training but poorly during inference.

10.2 Debugging Strategies

1. Layer-wise Evaluation:

Approach: Evaluate the model layer by layer to identify where accuracy drops occur.
Tools: Use hooks or custom modules to inspect outputs.

2. Activation Range Analysis:

Approach: Examine the activation ranges and histograms to ensure proper quantization parameters.
Tools: Collect activation statistics during calibration.

3. Increase Bit-Width:

Approach: Temporarily increase bit-width for sensitive layers to see if accuracy improves.
Action: Adjust quantization configurations.

4. Modify Model Architecture:

Approach: Replace or modify layers that are not quantization-friendly.
Example: Replace unsupported operations with alternatives.

5. Retrain with QAT:

Approach: Use Quantization-Aware Training to allow the model to adapt to quantization errors.
Action: Incorporate fake quantization nodes during training.

6. Verify Calibration Data Quality:

Approach: Ensure that the calibration dataset is representative of the inference data.
Action: Use a diverse and sufficient calibration dataset.

7. Use Per-Channel Quantization:

Approach: Switch from per-tensor to per-channel quantization for weights.
Benefit: Reduces quantization error for convolutional layers.

8. Check for Numerical Stability:

Approach: Ensure that numerical computations are stable and not causing overflows or underflows.
Action: Adjust quantization ranges or use higher precision where necessary.

##11. Quantizing Weights vs. Activations: Impact on Model Size Reduction

In the context of quantizing machine learning models, it’s important to understand how quantizing weights and activations individually contribute to reducing the overall model size.

###11.1 Quantizing Weights

Weights are the learned parameters of a neural network that are stored persistently and define the model’s architecture and behavior. Quantizing weights directly reduces the storage size of the model for the following reasons:

Direct Impact on Model Storage:
- Weights are saved as part of the model’s state dictionary (state_dict in PyTorch).
- Quantizing weights from 32-bit floating-point numbers (float32) to lower-bit integers (e.g., 8-bit integers) reduces the size of each weight parameter by a factor proportional to the reduction in bit-width.
- For example, moving from 32-bit to 8-bit representation reduces the storage requirement for weights by 75%.
Model Size Reduction Calculation:
- Original Model Size: \(\text{Size}_{\text{float32}} = N_{\text{weights}} \times 32\) bits
- Quantized Model Size: \(\text{Size}_{\text{int8}} = N_{\text{weights}} \times 8\) bits
- Size Reduction Percentage: \(\left(1 - \frac{8}{32}\right) \times 100\% = 75\%\)
Example:
- A model with 100 million parameters:
  - Original Size: \(100 \times 10^6 \times 32 \text{ bits} = 400 \text{ MB}\)
  - Quantized Size: \(100 \times 10^6 \times 8 \text{ bits} = 100 \text{ MB}\)
  - Reduction: 300 MB saved.

###11.2 Quantizing Activations

Activations are the outputs of layers during the forward pass of the model. They are transient and not stored as part of the model’s parameters.

Memory Usage During Inference:
- Quantizing activations reduces the memory footprint during inference since activations consume less memory.
- This can lead to faster inference times due to reduced memory bandwidth requirements.
No Direct Impact on Model Storage:
- Activations are not saved when the model is stored (e.g., when saving the model to disk).
- Quantizing activations does not reduce the model file size.

Comparative Analysis

Aspect	Quantizing Weights	Quantizing Activations
Reduces Model File Size	Yes (direct impact)	No
Reduces In-Memory Size	Yes	Yes (during inference)
Improves Inference Speed	Yes	Yes
Reduces Memory Bandwidth	Yes	Yes (significant impact)
Contributes to Latency Reduction	Yes	Yes
Impact on Training	Requires adjustment or retraining	Impacts inference only

###11.3 Summary

Model Size Reduction: Quantizing weights is the primary contributor to shrinking the model’s storage size. It directly reduces the size of the saved model on disk and the memory required to load the model parameters during inference.
Inference Efficiency: While quantizing activations does not affect the model’s storage size, it plays a crucial role in reducing the memory footprint and computational load during inference. This can lead to faster inference times and lower energy consumption, which is particularly important for deployment on edge devices.

Therefore, if the goal is to reduce the stored model size (e.g., for deployment where storage space is limited), quantizing weights has a more significant impact. Quantizing activations complements this by improving runtime efficiency but does not contribute to the reduction of the model’s file size.

###11.4 Additional Considerations

Combined Quantization: For optimal performance, both weights and activations are typically quantized together. This not only reduces the model size but also maximizes inference speed and efficiency.
Model Accuracy: Quantizing weights and activations can affect model accuracy. Quantization-aware training (QAT) is recommended to mitigate accuracy loss, especially when both weights and activations are quantized to low bit-widths.
Hardware Support: The benefits of quantizing activations are fully realized when the deployment hardware supports lower-precision arithmetic operations.

###11.5 Practical Example in PyTorch

Here’s how you might quantize only the weights of a model using PyTorch’s dynamic quantization, which is particularly useful for reducing model size:

import torch
import torch.nn as nn

# Assume we have a pre-trained floating-point model
model_fp32 = ...  # Your pre-trained model

# Apply dynamic quantization to weights (activations remain in float)
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,  # the model to quantize
    {nn.Linear},  # layers to quantize (e.g., nn.Linear layers)
    dtype=torch.qint8  # dtype for weights
)

# Save the quantized model
torch.save(model_int8.state_dict(), 'quantized_model.pth')

Explanation:
- quantize_dynamic replaces specified layers (e.g., nn.Linear) with dynamically quantized versions where the weights are quantized to int8, but activations remain in floating-point.
- This approach reduces the model size and can improve inference speed, especially for models with many linear layers like transformers.

Static vs Dynamic Quantization in Machine Learning

It has been a long time since I’ve shared any post so I thought today is the time for that :) In this post I’ll walk you through the details between static vs dynamic quantization which I think might be interesting. Let’s start!

What is quantization? Quantization is a process used in machine learning to reduce the precision of the numbers representing the model parameters, which can lead to smaller model sizes and faster inference times. There are different types of quantization, primarily static and dynamic quantization, each with its own set of advantages and use cases. Here I have another post which also has some other details about quantization.

Static Quantization

Static quantization converts the weights and activates of a neural network to lower precision (e.g., from 32-bit floating-point to 8-bit integers) during the training or post-training phase. Here we have a more detailed breakdown of static quantization:

Calibration Phase
- A calibration step is performed where the model runs on a representative dataset. This step is important as it helps to gather the distribution statistics of the activations, which are then used to determine the optimal scaling factors (quantization parameters) for each layer.
Quantization Parameters
- In this step, the model weights are quantized to a lower precision format (e.g., int8). The scale and zero-point for each layer are computed based on the calibration data and are fixed during inference.
Inference
- During inference, both the weights and activations are quantized to int8. Since the quantization parameters are fixed, the model uses these pre-determined scales and zero-points to perform fast, integer-only computations.
Performance
- Static quantization typically results in more efficient execution compared to dynamic quantization because all the computations can be done using integer arithmetic, which is faster on many hardware platforms. It often achieves better accuracy compared to dynamic quantization since the quantization parameters are finely tuned using the calibration data.

Use Cases of Static Quantization

Static quantization is well-suited for scenarios where the input data distribution is known and can be captured accurately during the calibration phase. It’s commonly used in deploying models on edge devices where computational resources are limited.

Here’s a code sample demonstrating static quantization using PyTorch:

import torch
import torchvision
import torch.quantization as quant

# Load a pre-trained model
model = torchvision.models.resnet18(pretrained=True)
model.eval()

# Define the quantization configuration
model.qconfig = torch.quantization.get_default_qconfig("fbgemm")

# Prepare the model for static quantization
model_prepared = torch.quantization.prepare(model, inplace=False)

# Calibrate the model with representative data
# Here we just run a few samples through the model
for _ in range(10):
    input_tensor = torch.randn(1, 3, 224, 224)
    model_prepared(input_tensor)

# Convert the model to quantized version
model_quantized = torch.quantization.convert(model_prepared, inplace=False)

# Save the quantized model
torch.save(model_quantized.state_dict(), "quantized_model.pth")

Dynamic Quantization

Dynamic quantization quantizes only the weights to a lower precision and leaves the activations in floating-point during the model’s runtime. Deeper look at dynamic quantization:

No Calibration Needed
- Dynamic quantization does not require a separate calibration phase. The quantization parameters are determined on-the-fly during inference. This makes it more straightforward to apply since it eliminates the need for a representative calibration dataset.
Quantization Parameters
- Model weights are quantized to lower precision as int8 format before inference. During inference, activations are dynamically quantized, which means their scale and zero-point are computed for each batch or layer during execution.
Inference
- Weights are stored and computed in int8, but activations remain in floating-point until they are used in computations. This allows the model to adapt to the variability in input data at runtime by recalculating the quantization parameters dynamically.
Performance
- Dynamic quantization typically incurs a lower reduction in model accuracy compared to static quantization since it can adapt to changes in input data distribution on-the-fly. However, it may not achieve the same level of inference speedup as static quantization because part of the computation still involves floating-point operations.

Use Cases: Dynamic quantization is particularly useful in scenarios where the input data distribution may vary and cannot be easily captured by a single representative dataset. It is often used in server-side deployments where computational resources are less constrained compared to edge devices. Sample code of dynamic quantization with PyTorch

import torch
import torchvision

# Load a pre-trained model
model = torchvision.models.resnet18(pretrained=True)
model.eval()

# Apply dynamic quantization
model_quantized = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save the quantized model
torch.save(model_quantized.state_dict(), 'dynamic_quantized_model.pth')

Static Quantization Workflow

Model Training: Train your model normally.
Calibration: Run the model on a representative dataset to determine quantization parameters.
Quantization: Convert model weights and activations to lower precision using fixed quantization parameters.
Inference: Perform fast, integer-only inference. Dynamic Quantization Workflow
Model Training: Train your model normally.
Quantization: Convert model weights to lower precision.
Inference: Dynamically quantize activations during inference, allowing for adaptable performance based on input data. Summary Both static and dynamic quantization offer ways to reduce the model size and improve inference efficiency but cater to different use cases and trade-offs:

Static Quantization requires a calibration step, uses fixed quantization parameters, offers faster inference with purely integer arithmetic, and is ideal for scenarios with known and stable input data distributions. Dynamic Quantization skips the calibration step, uses dynamically computed quantization parameters during inference, offers more flexibility with input data variability, and provides a simpler application process at the cost of slightly less inference efficiency compared to static quantization. Choosing between static and dynamic quantization depends on the specific requirements of the deployment environment, such as the stability of the input data distribution, the available computational resources, and the acceptable trade-offs between inference speed and model accuracy.

12. Conclusion

Quantization is a powerful technique for optimizing machine learning models for deployment on resource-constrained devices. By reducing the precision of weights and activations, we can achieve significant reductions in model size and computational requirements.

In this tutorial, we’ve explored:

The mathematical foundations of quantization, including the role of mean, variance, and histograms.
Techniques for quantizing weights and activations.
Implementation of Post-Training Quantization and Quantization-Aware Training in PyTorch.
Advanced topics like mixed precision quantization and layer fusion.
Handling BatchNorm layers and using Sync-BatchNorm in quantization.
Common issues encountered during quantization and strategies for debugging.
quantizing weights is more effective for reducing the model’s storage size, while quantizing activations primarily benefits runtime memory usage and computational efficiency during inference.

By understanding these concepts and applying them effectively, you can optimize your models for efficient deployment without significant loss in performance.

13. References

PyTorch Quantization Documentation:

https://pytorch.org/docs/stable/quantization.html
TensorFlow Model Optimization:

https://www.tensorflow.org/model_optimization
“Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” by Jacob et al., 2018:

https://arxiv.org/abs/1712.05877
“Mixed Precision Training” by Micikevicius et al., 2018:

https://arxiv.org/abs/1710.03740
“Deep Learning with Low Precision by Half Wave Gaussian Quantization” by Lin et al., 2016:

https://arxiv.org/abs/1608.01981
Static vs Dynamic Quantization in Machine Learning

https://selek.tech/posts/static-vs-dynamic-quantization-in-machine-learning/

Note: When implementing quantization in practice, always consider the specific requirements and constraints of your deployment environment. Test your quantized models thoroughly to ensure they meet the desired performance criteria.