Mask R-CNN: Extended Technical Deep Dive Tutorial (Fully Corrected)
Mask R-CNN: Extended Technical Deep Dive Tutorial (Fully Corrected)
๐ฏ Tutorial Objectives
This tutorial is written to provide an extensive understanding of the Mask R-CNN architecture by dissecting every individual component involved in its pipeline. You will:
- See annotated PyTorch code for every network block (Backbone, FPN, RPN, RoIAlign, Detection & Mask Heads).
- Understand how each tensor transforms through the network, with precise shape annotations.
- Learn about anchor generation, matching, and the similarities/differences between SSD and Mask R-CNN.
- Deep dive into loss function math and logic, especially focusing on segmentation loss choices.
- Get visual and conceptual clarity about how the model is trained, evaluated, and deployed.
๐งฑ 1. Backbone โ ResNet50/101
The backbone of Mask R-CNN is a deep residual network, most commonly ResNet-50 or ResNet-101. It acts as a powerful feature extractor. ResNet is composed of stacked residual blocks that mitigate vanishing gradients in very deep networks.
We take intermediate outputs after certain layers to form a feature hierarchy:
C2
fromlayer1
(stride = 4)C3
fromlayer2
(stride = 8)C4
fromlayer3
(stride = 16)C5
fromlayer4
(stride = 32)
These are fed into the FPN.
from torchvision.models import resnet50
import torch.nn as nn
class ResNetBackbone(nn.Module):
def __init__(self):
super().__init__()
net = resnet50(pretrained=True)
self.stage1 = nn.Sequential(net.conv1, net.bn1, net.relu, net.maxpool) # (B, 64, H/4, W/4)
self.stage2 = net.layer1 # (B, 256, H/4, W/4)
self.stage3 = net.layer2 # (B, 512, H/8, W/8)
self.stage4 = net.layer3 # (B, 1024, H/16, W/16)
self.stage5 = net.layer4 # (B, 2048, H/32, W/32)
def forward(self, x):
x = self.stage1(x)
c2 = self.stage2(x)
c3 = self.stage3(c2)
c4 = self.stage4(c3)
c5 = self.stage5(c4)
return c2, c3, c4, c5
These features preserve increasing semantic depth as resolution decreases. They are passed to the Feature Pyramid Network.
๐งญ 2. FPN โ Feature Pyramid Network
FPN is used to construct a pyramid of features with strong semantics at all levels. It helps with detecting objects of different sizes.
Key Properties:
- Top-down pathway with upsampling
- Lateral connections to maintain spatial structure
- Produces
P2
,P3
,P4
, andP5
, each with 256 channels
class FPN(nn.Module):
def __init__(self):
super().__init__()
self.lat5 = nn.Conv2d(2048, 256, 1)
self.lat4 = nn.Conv2d(1024, 256, 1)
self.lat3 = nn.Conv2d(512, 256, 1)
self.lat2 = nn.Conv2d(256, 256, 1)
self.smooth4 = nn.Conv2d(256, 256, 3, padding=1)
self.smooth3 = nn.Conv2d(256, 256, 3, padding=1)
self.smooth2 = nn.Conv2d(256, 256, 3, padding=1)
def forward(self, c2, c3, c4, c5):
p5 = self.lat5(c5)
p4 = self.lat4(c4) + nn.functional.interpolate(p5, scale_factor=2)
p3 = self.lat3(c3) + nn.functional.interpolate(p4, scale_factor=2)
p2 = self.lat2(c2) + nn.functional.interpolate(p3, scale_factor=2)
return [p2, self.smooth3(p3), self.smooth4(p4), p5]
Output Shapes:
Assuming input = (B, 3, 800, 800):
- P2 = (B, 256, 200, 200)
- P3 = (B, 256, 100, 100)
- P4 = (B, 256, 50, 50)
- P5 = (B, 256, 25, 25)
๐ฐ๏ธ 3. Region Proposal Network (RPN)
The RPN is a fully convolutional network that proposes candidate object bounding boxes.
Architecture:
- Shared 3ร3 conv โ ReLU
- Branch 1: 1ร1 conv to predict objectness scores
- Branch 2: 1ร1 conv to predict bounding box regression deltas
class RPNHead(nn.Module):
def __init__(self, in_channels=256, num_anchors=9):
super().__init__()
self.shared = nn.Conv2d(in_channels, 256, 3, padding=1)
self.cls_logits = nn.Conv2d(256, num_anchors * 2, 1)
self.bbox_pred = nn.Conv2d(256, num_anchors * 4, 1)
def forward(self, feats):
cls_logits, bbox_preds = [], []
for feat in feats:
t = nn.functional.relu(self.shared(feat))
cls_logits.append(self.cls_logits(t))
bbox_preds.append(self.bbox_pred(t))
return cls_logits, bbox_preds
Anchor Matching and Labeling:
- Anchor boxes (default boxes) are created for each pixel location.
- Positive: IoU โฅ 0.7 with GT box
- Negative: IoU โค 0.3
- Ignore: Otherwise
Similarity to SSD:
Both use anchors, but:
- SSD has multi-class classification per anchor
- RPN only classifies object vs background
Loss:
\[L_{\text{RPN}} = \frac{1}{N_{cls}} \sum \text{BCE}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum p_i^* \text{SmoothL1}(t_i, t_i^*)\]Hard Negative Mining:
Selects highest-loss negatives to maintain ratio (e.g., 1:3 pos:neg)
๐งพ 4. RoIAlign
RoIAlign improves over RoIPool by removing quantization. It precisely extracts fixed-size (7ร7) feature maps from input regions.
Key Idea:
- Divide RoI into bins (7ร7)
- For each bin, sample at 4 positions
- Apply bilinear interpolation on feature map
Output shape: (B, 256, 7, 7)
per RoI
๐ง 5. Heads
Box Head (Classification + Regression)
class BoxHead(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(256*7*7, 1024)
self.fc2 = nn.Linear(1024, 1024)
self.cls = nn.Linear(1024, 81)
self.bbox = nn.Linear(1024, 81 * 4)
def forward(self, x):
x = x.flatten(1)
x = nn.functional.relu(self.fc1(x))
x = nn.functional.relu(self.fc2(x))
return self.cls(x), self.bbox(x)
Mask Head (Per-class Binary Masks)
class MaskHead(nn.Module):
def __init__(self):
super().__init__()
self.convs = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU(),
nn.Conv2d(256, 256, 3, padding=1),
nn.ReLU()
)
self.upsample = nn.ConvTranspose2d(256, 256, 2, stride=2)
self.mask = nn.Conv2d(256, 81, 1) # K = 81
def forward(self, x):
x = self.convs(x)
x = nn.functional.relu(self.upsample(x))
return self.mask(x) # (B, 81, 28, 28)
Letโs walk through the complete flow from RPN to the final detection and classification heads in Mask R-CNN, step by step, with mathematical formulations and label generation logic.
๐ Flow: From RPN โ RoIAlign โ Detection Head
๐ฐ๏ธ 1. RPN (Region Proposal Network)
๐น Input
- Multi-scale FPN feature maps:
[P2, P3, P4, P5]
(shape: Bร256รHรW per level)
๐น Anchor Generation
-
At each location in each FPN level, generate
k
anchors.- E.g. 3 scales ร 3 aspect ratios =
k=9
- E.g. 3 scales ร 3 aspect ratios =
-
Total number of anchors: $A = \sum_l H_l \cdot W_l \cdot k$
๐น Predictions
For each anchor $a_i$, the RPN outputs:
- Objectness score $p_i \in [0,1]$
- Box deltas \(t_i = (\hat{t}_{x_i}, \hat{t}_{y_i}, \hat{t}_{w_i}, \hat{t}_{h_i})\)
๐น Training Labels for RPN
Each anchor $a_i$ is:
-
Assigned a label $p_i^* \in {0, 1, -1}$
- 1 โ positive (IoU โฅ 0.7 with a GT box)
- 0 โ negative (IoU โค 0.3 with all GT boxes)
- -1 โ ignore (between thresholds)
- Assigned a GT box $g_i = (x_{gt}, y_{gt}, w_{gt}, h_{gt})$
- Compute regression target deltas:
๐น RPN Loss
Let $N_{cls}$ and $N_{reg}$ be the number of samples:
\[L_{RPN} = \frac{1}{N_{cls}} \sum_i \text{BCE}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum_i p_i^* \cdot \text{SmoothL1}(t_i, t_i^*)\]Only positive anchors contribute to the regression loss.
๐ฆ 2. RoI Generation (Region Proposals)
After RPN:
- Apply predicted deltas to anchors to get proposal boxes:
- Apply NMS (e.g. IoU threshold 0.7)
- Keep top-N proposals (e.g. 1000 during training, 300 during test)
- These boxes are the RoIs
๐งพ 3. RoIAlign
For each RoI:
-
Map the proposal box to its corresponding feature map level $P_l$
-
Usually done using:
\[l = \lfloor 4 + \log_2(\sqrt{wh} / 224) \rfloor\]
-
-
Crop the feature map using bilinear interpolation into shape (B, 256, 7, 7)
Each ROI is pooled out of the feature maps and create a 256 x 7 x 7 feature map, we have 1000 ROIs per image.
Output:
Tensor of shape (N, 256, 7, 7) โ N = total number of RoIs across all images in the batch
๐ฏ 4. Detection Head (Classification + BBox Regression)
Inputs:
- RoI-aligned features: $R_i \in \mathbb{R}^{256 \times 7 \times 7}$
Network:
- Two FC layers โ feature vector $f_i \in \mathbb{R}^{1024}$
-
Heads:
- Classification: $\hat{p}_i \in \mathbb{R}^{K+1}$ โ softmax over classes
- Regression: $\hat{t}_i \in \mathbb{R}^{(K+1) \times 4}$ โ per-class bbox deltas
๐ท๏ธ 5. Training Labels for Detection Head
For each RoI $r_i$:
1. Match to GT box using IoU
-
If IoU โฅ 0.5 โ positive
- Assign class label $c_i \in [1, K]$
- Assign matched GT box $g_i$
-
If IoU < 0.5 โ background
- Assign label $c_i = 0$
- No regression target
2. Regression Target Deltas
For positive RoIs $r_i = (x_r, y_r, w_r, h_r)$:
\(t_{xi}^* = \frac{x_{gt} - x_r}{w_r},\quad t_{yi}^* = \frac{y_{gt} - y_r}{h_r}, \quad t_{wi}^* = \log\left(\frac{w_{gt}}{w_r}\right),\quad t_{hi}^* = \log\left(\frac{h_{gt}}{h_r}\right)\) Note: These are class-specific โ only the GT class channel is trained.
๐งฎ 6. Detection Loss
Let:
- $\hat{p}_i$: predicted class scores
- $c_i$: GT class
- $\hat{t}_{c_i}$: predicted deltas for class $c_i$
- $t_i^*$: GT deltas for RoI $i$
Then the total loss:
\[L = \frac{1}{N_{cls}} \sum_i \text{CE}(\hat{p}_i, c_i) + \frac{1}{N_{reg}} \sum_i \mathbb{1}_{[c_i > 0]} \cdot \text{SmoothL1}(\hat{t}_{c_i}, t_i^*)\]- Classification: Cross-Entropy over K+1 classes
- Regression: Smooth L1 for positive RoIs only
โ Summary Diagram
Anchors โ RPN (cls + bbox) โ Deltas + Scores โ Proposals (RoIs)
โ RoIAlign (7x7x256)
โ FC โ Classification Head โ Class probs
โ FC โ BBox Head โ K-class regression
๐ 6. Losses
6.1 Detection Classification:
Excellent question. Letโs clarify how detection is performed in Mask R-CNN and whether anchors are involved at that stage.
The detection head does not use anchors. Anchors are only used in the RPN (Region Proposal Network). The detection head operates on refined RoIs (region proposals) generated from RPN outputs.
1. Anchors Are Used Only in RPN
- Anchors are generated at each location of FPN feature maps.
-
For each anchor:
- RPN predicts: objectness score + bbox regression offsets.
- Top scoring boxes (after NMS) are selected as region proposals (RoIs).
These RoIs are then passed to the next stage โ the detection head.
2. Detection Head Receives Aligned RoIs
- RoIs are refined bounding boxes, not anchor templates.
- They are extracted via RoIAlign into fixed-size features (e.g., 7ร7ร256).
- These features are passed to the detection head.
๐ฆ Detection Head
Inputs:
- RoI-aligned features of shape
(B, 256, 7, 7)
Architecture:
- 2 Fully Connected (FC) layers
-
Output:
- Classification logits over
K
object classes - Class-specific bbox deltas (81 ร 4 values in COCO)
- Classification logits over
Post-processing:
- Apply bbox deltas to RoIs to get refined boxes.
- Run softmax over classification logits.
- Apply NMS per class to suppress redundant detections.
Stage | Uses Anchors? | Description |
---|---|---|
RPN | โ Yes | Anchors + regression โ Proposals |
Detection Head | โ No | Operates on RPN outputs (RoIs), classifies and refines boxes |
So detection is not anchor-based, but rather proposal-based, which are refined anchor regressions.
Let me know if youโd like to insert this explanation into the tutorial file.
\[L_{cls} = - \sum_{i} y_i \log(p_i) \quad \text{(softmax cross-entropy)}\]6.2 BBox Regression:
\[L_{box} = \text{SmoothL1}(t_i, t_i^*)\]6.3 Mask Segmentation:
- Only 1 out of K channels is trained per RoI (the GT class)
- So a softmax over channels is invalid
\(L_{mask} = \frac{1}{m^2} \sum_{i,j} \text{BCE}(M_k[i,j], M_k^*[i,j])\)
Why Mask R-CNN Uses BCE Instead of Softmax + Categorical Cross-Entropy (CCE)
Mask R-CNN uses Binary Cross-Entropy (BCE) for its mask head loss, not softmax with categorical cross-entropy (CCE). Hereโs why:
๐งฉ Architecture Choice: One Binary Mask per Class
- The mask head outputs
K
channels (e.g.,K=81
for COCO), each of size28x28
. - Each channel represents a binary mask for one class.
-
During training, only the channel corresponding to the ground-truth class is supervised.
- For example, if a given RoI is labeled as โpersonโ (class 1), only the 2nd mask is trained using the binary mask for โpersonโ.
- The other 80 channels are ignored.
- This is because in general ROIs might overlap and segmentation masks for different ROIS might overlap. So it is not a categorical segmentation it could be multi labels.
- In general for each ROI we calculated the mask loss separately.
๐ซ Why Softmax + CCE Would Be Invalid
Softmax + categorical cross-entropy assumes:
- All output channels represent mutually exclusive class probabilities.
- A softmax across channels produces a normalized distribution over all
K
classes at each pixel. - You supervise every pixel to predict exactly one of the
K
classes.
But in Mask R-CNN:
- You do not train all
K
channels. Only the one for the GT class is used. - The other channels are not supervised, so the softmax is ill-defined.
- This violates the assumption required for softmax + CCE to work (i.e., supervision for all classes per pixel).
Class Agnostic Segmentation Head
In the case the number of classes are too large e.g. $K = 1000$ using 1000 segmentation masks in the model makes the model too heavy one solution is to use class agnostic heads.
๐ What Happens with a Class-Agnostic Mask Head?
โ Output Shape
If your mask head is class-agnostic, it outputs:
\[\text{(N, 1, 28, 28)}\]Where:
- $N$ = number of RoIs (say 100 or 1000)
- Each RoI is associated with one object instance in the scene
So this means:
- Youโre predicting N masks at once, one per RoI
- NO for-loop is needed โ itโs all in one batched tensor
๐ก Why Itโs Still Scalable and Fast
- Mask prediction is a batched operation
-
The input to the mask head is:
\[\text{RoIAlign output} \rightarrow \text{(N, 256, 14, 14)} or (7,7)\] -
It goes through conv layers โ upsample โ output:
\[\text{(N, 1, 28, 28)} โ One mask per RoI\]
This is similar to how object detection works in parallel for all RoIs.
๐ Difference vs. Per-Class Mask (Standard Mask R-CNN)
Head Type | Output Shape | Loss Computed On | Notes |
---|---|---|---|
Class-specific | (N, K, 28, 28) | Only GT class mask | One-hot channel mask (train only one) |
Class-agnostic | (N, 1, 28, 28) | Single mask per RoI | No per-class modeling |
So for 100 RoIs, you still predict 100 masks in parallel:
- Just one channel per mask instead of K channels
- Memory: N ร 1 ร 28 ร 28 vs N ร K ร 28 ร 28
๐ค Your Concern: Multiple Objects in the Scene
โIf we have 100 objects in the scene, can we predict 100 masks?โ
โ Yes โ each RoI corresponds to one object, and Mask R-CNN assigns each RoI to a GT box during training.
- So if there are 100 objects, and the RPN proposes 100 good regions, youโll have 100 RoIs.
- Each RoI gets one mask โ predicted simultaneously.
Concern | Reality |
---|---|
โOne mask output means only 1 object?โ | โ No โ itโs one mask per RoI, and RoIs = objects |
โNeed a for-loop to process RoIs?โ | โ No โ itโs fully batched: all RoIs processed in parallel |
โWhy is it faster?โ | โ Smaller output tensor, no class-specific branching |
๐ฏ Why Mask R-CNN Uses K Channels in the Segmentation Head
The original Mask R-CNN design by He et al. uses K binary masks per RoI, where:
- $K$ = number of foreground classes (e.g., 80 in COCO)
-
Each RoI outputs (1 binary mask for each class) โ shape:
\[(N, K, 28, 28)\]
But hereโs the crucial point:
At training time, only the mask corresponding to the GT class is supervised. At inference time, only the mask corresponding to the predicted class is used.
๐ง Why This Class-Specific Design Was Chosen
-
Mask shape is often class-specific
- The shape of a person, car, and cat differ significantly.
- Having one mask branch per class allows the model to specialize shape priors.
-
Shared computation, but specialized outputs
- The same conv layers are used, but the final conv output has $K$ channels.
- Only one of them is used โ but the learning capacity is still valuable.
โ So Do We Need K Channels?
No, not strictly. You can absolutely modify Mask R-CNN to use:
๐น Class-Agnostic Mask Head:
\[(N, 1, 28, 28)\]And predict just one mask per RoI, regardless of class.
This is:
- Lighter
- Faster
- Often almost as accurate
- Especially useful when $K$ is large (e.g., 1000+ classes)
โ๏ธ Trade-off Table
Design | Output Shape | Memory | Accuracy | Class-specific shape? |
---|---|---|---|---|
Per-class (original) | (N, K, 28, 28) | High | โ better | โ yes |
Class-agnostic | (N, 1, 28, 28) | Low | โ ๏ธ slightly lower | โ no |
๐ ๏ธ In Practice
- Detectron2, MMDetection, and other frameworks offer class-agnostic mode.
-
Empirically, class-agnostic masks work well in many settings โ especially with:
- Few training samples per class
- Lightweight models
- Edge deployment
โ Conclusion
Mask R-CNN does not fundamentally need K channels in the mask output.
It was a design choice for better accuracy and class-specific modeling. Class-agnostic heads are a valid, simpler, and often preferable alternative when scalability matters.
โ Why BCE Works for Mask R-CNN
- BCE is applied per-pixel, per-class, independently.
- It models: โIs this pixel foreground for this class?โ (binary yes/no).
- Since we train only the GT class channel, BCE perfectly fits this logic.
- It allows us to treat each mask as a separate binary segmentation task.
๐จ What Happens If You Use Softmax + CCE Anyway?
If you incorrectly apply softmax across channels:
- It would force the network to produce a probability distribution over all classes per pixel.
- Since only one class is being supervised (GT class), the model has no ground truth to supervise the other
K-1
classes. - This leads to unstable training, false gradients, and degraded performance.
In short: Softmax + CCE is semantically wrong for per-RoI binary masks trained only for GT class. BCE is correct because it models the actual training behavior: 1 mask per GT class per RoI.
To redesign Mask R-CNN to use softmax over mask outputs with categorical cross-entropy (CCE), you must change how masks are predicted and supervised. This fundamentally alters the mask head architecture and training logic.
โ Objective: Softmax Mask Prediction with CCE
We want each pixel in the predicted mask to output a single class label (as in semantic segmentation). This requires:
Requirement | Current Mask R-CNN | Modified Design (Softmax) |
---|---|---|
Per-pixel prediction | Binary (1 per class) | Categorical (1-of-K) |
Channels in mask head output | K binary masks (BรKร28ร28) |
1 categorical mask (BรKร28ร28) |
Supervised channels | Only GT class mask | All pixels with class label |
Loss | BCE on GT class | CCE over softmax(K) |
๐ง Design Changes Required
1. Mask Head Output Remains (B, K, 28, 28)
No change to the output shape โ we still predict K
channels per RoI.
2. Supervision Must Change
You must now provide a categorical mask label for each pixel in the RoI crop. That is, instead of training a binary mask only for the GT class, you now train a full pixel-wise class map with values in [0, ..., K-1]
.
This is hard because:
- Each RoI corresponds to only one object, which is of a single class.
- There is no semantic reason for pixels within an RoI to have multiple class labels.
To make this work:
- Youโd need to merge overlapping masks and assign per-pixel class labels (like semantic segmentation).
- This makes it no longer an instance segmentation problem, but semantic segmentation.
3. Loss Function
Replace BCE with softmax + cross-entropy:
# logits: (B, K, 28, 28)
# targets: (B, 28, 28) with values in [0, K-1]
loss = nn.CrossEntropyLoss()(logits, targets)
4. Mask Target Construction
Instead of a binary mask per RoI:
- You would need to project instance masks into a shared canvas per RoI.
- And resolve overlaps into a single class label per pixel.
This contradicts the core instance-level logic of Mask R-CNN.
๐ซ Why This is Usually Not Done
- RoIs are defined per instance, not for the whole image.
- There is no natural multi-class pixel label per RoI.
- The model should only answer: โWhere is the mask for this object?โ, not โWhich class does each pixel belong to?โ โ thatโs a semantic segmentation task, not instance segmentation.
โ Summary
To support softmax + CCE for segmentation masks in Mask R-CNN:
Step | Change Required |
---|---|
1 | Mask targets must be class labels per pixel |
2 | Mask loss becomes CCE instead of BCE |
3 | Every mask channel must be supervised jointly |
4 | Overlapping instances must be resolved by pixel class |
But this destroys instance separation, which is the entire point of Mask R-CNN. This change essentially turns the model into a semantic segmentation network, not an instance segmentation one.
โ Summary
- RPN has its own cls + reg heads like SSD
- Anchors matched using IoU, balanced with hard negative mining
- RoIAlign allows precise fixed-size feature extraction
- Only the GT class mask is supervised, so BCE is correct
- Softmax + CCE would require all K masks to compete, which is not the design here