Efficient ML Techniques in Transformers
Efficient ML Techniques in Transformers
Vision Transformers (ViTs) have become a popular choice for image recognition and related tasks, but they can be computationally expensive and memory-heavy. Below is a list of common (and often complementary) techniques to optimize Transformers—including ViTs—for more efficient training and inference. Alongside each category, I’ve mentioned some influential or representative papers.
1. Efficient Attention Mechanisms
Key Idea: Replace the standard O(N²) self-attention (where N is the number of tokens) with more efficient variants, typically by imposing low-rank structure, using kernel approximations, or random feature mappings.
-
Linformer
- Paper: Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)
- Idea: Projects the sequence length dimension to a lower dimension, reducing the complexity from O(N²) to O(N). -
Performer
- Paper: Rethinking Attention with Performers (Choromanski et al., 2021)
- Idea: Uses random feature maps to approximate the softmax attention, enabling linear-time attention. -
Reformer
- Paper: Reformer: The Efficient Transformer (Kitaev et al., 2020)
- Idea: Uses locality-sensitive hashing (LSH) to reduce the complexity of attention and reversible residual layers to save memory. -
Big Bird
- Paper: Big Bird: Transformers for Longer Sequences (Zaheer et al., 2020)
- Idea: Combines random, global, and local attentions to handle very long sequences efficiently. -
Axial Attentions - papers: Axial Attention in Multidimensional Transformers (Ho et al., 2019) - Idea: Use combinations of 1D attentions in each dimention instead of one big 2D attention.
2. Model Compression Techniques
2.1 Pruning
Key Idea: Remove weights or tokens deemed unimportant.
-
Structured Pruning (e.g., heads, entire tokens):
Removing entire attention heads or tokens (in the case of Vision Transformers) that contribute less to the final prediction. -
Movement Pruning:
- Paper: Movement Pruning: Adaptive Sparsity by Fine-Tuning (Sanh et al., 2020)
- Idea: Learns which weights to remove during fine-tuning, guided by the movement of weights during training. -
Token Pruning / Early Exiting for ViT:
Prune unimportant tokens dynamically or terminate computation early if predictions are sufficiently confident.
- Example approach: Dynamic Token Pruning for Vision Transformers.
2.2 Quantization
Key Idea: Use fewer bits (e.g., 8-bit or even lower precision) to represent weights and/or activations without significantly degrading accuracy.
- Seminal Early Work: Model Compression via Distillation and Quantization (Hinton et al., 2015) introduced the broader framework of compression.
- Applied to Transformers: Quantizing Deep Convolutional + Transformer Models, among others.
2.3 Low-Rank Factorization
Key Idea: Approximate large weight matrices with products of lower-rank matrices.
- Representative Work: Tensorizing Neural Networks shows how to reshape weights into tensors and factor them efficiently.
- Vision Transformer context: Factoring projection matrices or MLP weights in Transformers can reduce parameters and computation.
2.4 Knowledge Distillation
Key Idea: Train a smaller “student” model (or same-sized model but more efficient architecture) to match outputs of a larger “teacher” model.
-
DeiT
- Paper: Training Data-Efficient Image Transformers & Distillation through Attention (Touvron et al., 2021)
- Idea: Shows that Transformers can be trained effectively with fewer data if distilled from a CNN or a larger Transformer. -
TinyBERT, MobileBERT, etc. (more general for NLP, but the idea is the same)
- Papers: TinyBERT, MobileBERT.
3. Parameter Efficient Fine-Tuning
Key Idea: Instead of fine-tuning all parameters, only update (or add) a small subset of the parameters. This is especially relevant when you deal with large-scale ViTs.
-
Adapters / LoRA (Low-Rank Adaptation):
- Paper: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- Idea: Insert small trainable low-rank matrices into Transformer layers to handle new tasks, reducing the overhead. -
Prefix Tuning / Prompt Tuning (originating from NLP, can also be adapted for ViTs).
4. Mixture-of-Experts (MoE)
Key Idea: Scale model capacity by having multiple “expert” layers, but activate only a subset for each input, reducing total computation.
- Representative Work: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Fedus et al., 2021)
- Though MoE approaches are more common in large-scale language models, they can theoretically be applied to ViTs as well.
5. Architectural Re-design
Sometimes, simply rethinking the architecture yields a more efficient design:
-
Hybrid CNN-Transformer Architectures: Use convolution in early stages for low-level feature extraction, then apply Transformers on higher-level tokens (reducing total sequence length).
- Paper Example: LocalViT: Bringing Locality to Vision Transformers -
Pyramid Transformers / Swin Transformers:
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Liu et al., 2021)
- Idea: Reduce sequence length by using patch hierarchies and windowed self-attention. -
Patch Merging / Pooling: Combine patches progressively so that later layers have fewer tokens.
6. Dynamic Computation / Early Exiting
Key Idea: Not all inputs require the same amount of computation. For some inputs, we can exit early or skip certain layers/tokens.
- Representative Idea (in NLP): LayerDrop: Trading Accuracy for Efficiency by Dropping Layers in Pre-trained Models
- Applied to Vision: Dynamic token pruning / partial inference once the model is sufficiently confident.
Summary of Techniques & Seminal References
- Efficient Attention
- Linformer (Wang et al., 2020)
- Performer (Choromanski et al., 2021)
- Reformer (Kitaev et al., 2020)
- Big Bird (Zaheer et al., 2020)
- Axial Attentions (Ho et al., 2019)
- Model Compression
- Pruning: Movement Pruning (Sanh et al., 2020)
- Quantization: Early works by Hinton et al. and subsequent follow-ups
- Low-Rank Factorization: Tensorizing Neural Networks (Novikov et al., 2015)
- Distillation: DeiT (Touvron et al., 2021), Hinton et al. (2015)
- Parameter-Efficient Fine-Tuning
- LoRA (Hu et al., 2021), Adapter-BERT approaches
- Mixture-of-Experts
- Switch Transformer (Fedus et al., 2021)
- Architectural Tweaks
- Swin Transformer (Liu et al., 2021)
- Hybrid CNN + Transformer
- Pyramid ViTs
- Dynamic Computation / Early Exiting
- LayerDrop (Fan et al., 2019)
- Dynamic Token Pruning (various recent works)
Final Note
In practice, combining multiple of these strategies often yields the best tradeoff between accuracy and efficiency. For Vision Transformers, a common recipe might be:
- Use an efficient attention scheme (like local/windowed attention).
- Add architectural innovations (pyramidal design, patch merging).
- Apply knowledge distillation for further accuracy boosts with fewer parameters.
- Optionally prune or quantize the final model for edge or latency-sensitive deployments.
Each of these categories has a rich body of research. If you’re aiming to build an efficient ViT from scratch, you could start with a well-known efficient ViT backbone (e.g., Swin Transformer, MobileViT, etc.) and then apply compression or distillation on top.