Efficient ML Techniques in Transformers

Vision Transformers (ViTs) have become a popular choice for image recognition and related tasks, but they can be computationally expensive and memory-heavy. Below is a list of common (and often complementary) techniques to optimize Transformers—including ViTs—for more efficient training and inference. Alongside each category, I’ve mentioned some influential or representative papers.


1. Efficient Attention Mechanisms

Key Idea: Replace the standard O(N²) self-attention (where N is the number of tokens) with more efficient variants, typically by imposing low-rank structure, using kernel approximations, or random feature mappings.


2. Model Compression Techniques

2.1 Pruning

Key Idea: Remove weights or tokens deemed unimportant.

  • Structured Pruning (e.g., heads, entire tokens):
    Removing entire attention heads or tokens (in the case of Vision Transformers) that contribute less to the final prediction.

  • Movement Pruning:
    - Paper: Movement Pruning: Adaptive Sparsity by Fine-Tuning (Sanh et al., 2020)
    - Idea: Learns which weights to remove during fine-tuning, guided by the movement of weights during training.

  • Token Pruning / Early Exiting for ViT:
    Prune unimportant tokens dynamically or terminate computation early if predictions are sufficiently confident.
    - Example approach: Dynamic Token Pruning for Vision Transformers.

2.2 Quantization

Key Idea: Use fewer bits (e.g., 8-bit or even lower precision) to represent weights and/or activations without significantly degrading accuracy.

2.3 Low-Rank Factorization

Key Idea: Approximate large weight matrices with products of lower-rank matrices.

  • Representative Work: Tensorizing Neural Networks shows how to reshape weights into tensors and factor them efficiently.
  • Vision Transformer context: Factoring projection matrices or MLP weights in Transformers can reduce parameters and computation.

2.4 Knowledge Distillation

Key Idea: Train a smaller “student” model (or same-sized model but more efficient architecture) to match outputs of a larger “teacher” model.


3. Parameter Efficient Fine-Tuning

Key Idea: Instead of fine-tuning all parameters, only update (or add) a small subset of the parameters. This is especially relevant when you deal with large-scale ViTs.

  • Adapters / LoRA (Low-Rank Adaptation):
    - Paper: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
    - Idea: Insert small trainable low-rank matrices into Transformer layers to handle new tasks, reducing the overhead.

  • Prefix Tuning / Prompt Tuning (originating from NLP, can also be adapted for ViTs).


4. Mixture-of-Experts (MoE)

Key Idea: Scale model capacity by having multiple “expert” layers, but activate only a subset for each input, reducing total computation.


5. Architectural Re-design

Sometimes, simply rethinking the architecture yields a more efficient design:


6. Dynamic Computation / Early Exiting

Key Idea: Not all inputs require the same amount of computation. For some inputs, we can exit early or skip certain layers/tokens.


Summary of Techniques & Seminal References

  1. Efficient Attention
    • Linformer (Wang et al., 2020)
    • Performer (Choromanski et al., 2021)
    • Reformer (Kitaev et al., 2020)
    • Big Bird (Zaheer et al., 2020)
    • Axial Attentions (Ho et al., 2019)
  2. Model Compression
    • Pruning: Movement Pruning (Sanh et al., 2020)
    • Quantization: Early works by Hinton et al. and subsequent follow-ups
    • Low-Rank Factorization: Tensorizing Neural Networks (Novikov et al., 2015)
    • Distillation: DeiT (Touvron et al., 2021), Hinton et al. (2015)
  3. Parameter-Efficient Fine-Tuning
    • LoRA (Hu et al., 2021), Adapter-BERT approaches
  4. Mixture-of-Experts
    • Switch Transformer (Fedus et al., 2021)
  5. Architectural Tweaks
    • Swin Transformer (Liu et al., 2021)
    • Hybrid CNN + Transformer
    • Pyramid ViTs
  6. Dynamic Computation / Early Exiting
    • LayerDrop (Fan et al., 2019)
    • Dynamic Token Pruning (various recent works)

Final Note

In practice, combining multiple of these strategies often yields the best tradeoff between accuracy and efficiency. For Vision Transformers, a common recipe might be:

  1. Use an efficient attention scheme (like local/windowed attention).
  2. Add architectural innovations (pyramidal design, patch merging).
  3. Apply knowledge distillation for further accuracy boosts with fewer parameters.
  4. Optionally prune or quantize the final model for edge or latency-sensitive deployments.

Each of these categories has a rich body of research. If you’re aiming to build an efficient ViT from scratch, you could start with a well-known efficient ViT backbone (e.g., Swin Transformer, MobileViT, etc.) and then apply compression or distillation on top.