Mask R-CNN: Extended Technical Deep Dive Tutorial (Fully Corrected)


๐ŸŽฏ Tutorial Objectives

This tutorial is written to provide an extensive understanding of the Mask R-CNN architecture by dissecting every individual component involved in its pipeline. You will:

  • See annotated PyTorch code for every network block (Backbone, FPN, RPN, RoIAlign, Detection & Mask Heads).
  • Understand how each tensor transforms through the network, with precise shape annotations.
  • Learn about anchor generation, matching, and the similarities/differences between SSD and Mask R-CNN.
  • Deep dive into loss function math and logic, especially focusing on segmentation loss choices.
  • Get visual and conceptual clarity about how the model is trained, evaluated, and deployed.

๐Ÿงฑ 1. Backbone โ€“ ResNet50/101

The backbone of Mask R-CNN is a deep residual network, most commonly ResNet-50 or ResNet-101. It acts as a powerful feature extractor. ResNet is composed of stacked residual blocks that mitigate vanishing gradients in very deep networks.

We take intermediate outputs after certain layers to form a feature hierarchy:

  • C2 from layer1 (stride = 4)
  • C3 from layer2 (stride = 8)
  • C4 from layer3 (stride = 16)
  • C5 from layer4 (stride = 32)

These are fed into the FPN.

from torchvision.models import resnet50
import torch.nn as nn

class ResNetBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        net = resnet50(pretrained=True)
        self.stage1 = nn.Sequential(net.conv1, net.bn1, net.relu, net.maxpool)  # (B, 64, H/4, W/4)
        self.stage2 = net.layer1  # (B, 256, H/4, W/4)
        self.stage3 = net.layer2  # (B, 512, H/8, W/8)
        self.stage4 = net.layer3  # (B, 1024, H/16, W/16)
        self.stage5 = net.layer4  # (B, 2048, H/32, W/32)

    def forward(self, x):
        x = self.stage1(x)
        c2 = self.stage2(x)
        c3 = self.stage3(c2)
        c4 = self.stage4(c3)
        c5 = self.stage5(c4)
        return c2, c3, c4, c5

These features preserve increasing semantic depth as resolution decreases. They are passed to the Feature Pyramid Network.


๐Ÿงญ 2. FPN โ€“ Feature Pyramid Network

FPN is used to construct a pyramid of features with strong semantics at all levels. It helps with detecting objects of different sizes.

Key Properties:

  • Top-down pathway with upsampling
  • Lateral connections to maintain spatial structure
  • Produces P2, P3, P4, and P5, each with 256 channels
class FPN(nn.Module):
    def __init__(self):
        super().__init__()
        self.lat5 = nn.Conv2d(2048, 256, 1)
        self.lat4 = nn.Conv2d(1024, 256, 1)
        self.lat3 = nn.Conv2d(512, 256, 1)
        self.lat2 = nn.Conv2d(256, 256, 1)

        self.smooth4 = nn.Conv2d(256, 256, 3, padding=1)
        self.smooth3 = nn.Conv2d(256, 256, 3, padding=1)
        self.smooth2 = nn.Conv2d(256, 256, 3, padding=1)

    def forward(self, c2, c3, c4, c5):
        p5 = self.lat5(c5)
        p4 = self.lat4(c4) + nn.functional.interpolate(p5, scale_factor=2)
        p3 = self.lat3(c3) + nn.functional.interpolate(p4, scale_factor=2)
        p2 = self.lat2(c2) + nn.functional.interpolate(p3, scale_factor=2)
        return [p2, self.smooth3(p3), self.smooth4(p4), p5]

Output Shapes:

Assuming input = (B, 3, 800, 800):

  • P2 = (B, 256, 200, 200)
  • P3 = (B, 256, 100, 100)
  • P4 = (B, 256, 50, 50)
  • P5 = (B, 256, 25, 25)

๐Ÿ›ฐ๏ธ 3. Region Proposal Network (RPN)

The RPN is a fully convolutional network that proposes candidate object bounding boxes.

Architecture:

  • Shared 3ร—3 conv โ†’ ReLU
  • Branch 1: 1ร—1 conv to predict objectness scores
  • Branch 2: 1ร—1 conv to predict bounding box regression deltas
class RPNHead(nn.Module):
    def __init__(self, in_channels=256, num_anchors=9):
        super().__init__()
        self.shared = nn.Conv2d(in_channels, 256, 3, padding=1)
        self.cls_logits = nn.Conv2d(256, num_anchors * 2, 1)
        self.bbox_pred = nn.Conv2d(256, num_anchors * 4, 1)

    def forward(self, feats):
        cls_logits, bbox_preds = [], []
        for feat in feats:
            t = nn.functional.relu(self.shared(feat))
            cls_logits.append(self.cls_logits(t))
            bbox_preds.append(self.bbox_pred(t))
        return cls_logits, bbox_preds

Anchor Matching and Labeling:

  • Anchor boxes (default boxes) are created for each pixel location.
  • Positive: IoU โ‰ฅ 0.7 with GT box
  • Negative: IoU โ‰ค 0.3
  • Ignore: Otherwise

Similarity to SSD:

Both use anchors, but:

  • SSD has multi-class classification per anchor
  • RPN only classifies object vs background

Loss:

\[L_{\text{RPN}} = \frac{1}{N_{cls}} \sum \text{BCE}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum p_i^* \text{SmoothL1}(t_i, t_i^*)\]

Hard Negative Mining:

Selects highest-loss negatives to maintain ratio (e.g., 1:3 pos:neg)


๐Ÿงพ 4. RoIAlign

RoIAlign improves over RoIPool by removing quantization. It precisely extracts fixed-size (7ร—7) feature maps from input regions.

Key Idea:

  • Divide RoI into bins (7ร—7)
  • For each bin, sample at 4 positions
  • Apply bilinear interpolation on feature map

Output shape: (B, 256, 7, 7) per RoI


๐Ÿง  5. Heads

Box Head (Classification + Regression)

class BoxHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(256*7*7, 1024)
        self.fc2 = nn.Linear(1024, 1024)
        self.cls = nn.Linear(1024, 81)
        self.bbox = nn.Linear(1024, 81 * 4)

    def forward(self, x):
        x = x.flatten(1)
        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        return self.cls(x), self.bbox(x)

Mask Head (Per-class Binary Masks)

class MaskHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.convs = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU()
        )
        self.upsample = nn.ConvTranspose2d(256, 256, 2, stride=2)
        self.mask = nn.Conv2d(256, 81, 1)  # K = 81

    def forward(self, x):
        x = self.convs(x)
        x = nn.functional.relu(self.upsample(x))
        return self.mask(x)  # (B, 81, 28, 28)

Letโ€™s walk through the complete flow from RPN to the final detection and classification heads in Mask R-CNN, step by step, with mathematical formulations and label generation logic.


๐Ÿ“Š Flow: From RPN โ†’ RoIAlign โ†’ Detection Head


๐Ÿ›ฐ๏ธ 1. RPN (Region Proposal Network)

๐Ÿ”น Input

  • Multi-scale FPN feature maps: [P2, P3, P4, P5] (shape: Bร—256ร—Hร—W per level)

๐Ÿ”น Anchor Generation

  • At each location in each FPN level, generate k anchors.

    • E.g. 3 scales ร— 3 aspect ratios = k=9
  • Total number of anchors: $A = \sum_l H_l \cdot W_l \cdot k$

๐Ÿ”น Predictions

For each anchor $a_i$, the RPN outputs:

  • Objectness score $p_i \in [0,1]$
  • Box deltas \(t_i = (\hat{t}_{x_i}, \hat{t}_{y_i}, \hat{t}_{w_i}, \hat{t}_{h_i})\)

๐Ÿ”น Training Labels for RPN

Each anchor $a_i$ is:

  • Assigned a label $p_i^* \in {0, 1, -1}$

    • 1 โ†’ positive (IoU โ‰ฅ 0.7 with a GT box)
    • 0 โ†’ negative (IoU โ‰ค 0.3 with all GT boxes)
    • -1 โ†’ ignore (between thresholds)
  • Assigned a GT box $g_i = (x_{gt}, y_{gt}, w_{gt}, h_{gt})$
  • Compute regression target deltas:
\[t_{xi}^* = \frac{x_{gt} - x_{a}}{w_a},\quad t_{yi}^* = \frac{y_{gt} - y_{a}}{h_a}. \quad t_{wi}^* = \log\left(\frac{w_{gt}}{w_a}\right),\quad t_{hi}^* = \log\left(\frac{h_{gt}}{h_a}\right)\]

๐Ÿ”น RPN Loss

Let $N_{cls}$ and $N_{reg}$ be the number of samples:

\[L_{RPN} = \frac{1}{N_{cls}} \sum_i \text{BCE}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}} \sum_i p_i^* \cdot \text{SmoothL1}(t_i, t_i^*)\]

Only positive anchors contribute to the regression loss.


๐Ÿ“ฆ 2. RoI Generation (Region Proposals)

After RPN:

  • Apply predicted deltas to anchors to get proposal boxes:
\[x_p = \hat{t}_x \cdot w_a + x_a,\quad y_p = \hat{t}_y \cdot h_a + y_a,\quad w_p = w_a \cdot e^{\hat{t}_w},\quad h_p = h_a \cdot e^{\hat{t}_h}\]
  • Apply NMS (e.g. IoU threshold 0.7)
  • Keep top-N proposals (e.g. 1000 during training, 300 during test)
  • These boxes are the RoIs

๐Ÿงพ 3. RoIAlign

For each RoI:

  • Map the proposal box to its corresponding feature map level $P_l$

    • Usually done using:

      \[l = \lfloor 4 + \log_2(\sqrt{wh} / 224) \rfloor\]
  • Crop the feature map using bilinear interpolation into shape (B, 256, 7, 7)

    Each ROI is pooled out of the feature maps and create a 256 x 7 x 7 feature map, we have 1000 ROIs per image.

    Output:

Tensor of shape (N, 256, 7, 7) โ€” N = total number of RoIs across all images in the batch


๐ŸŽฏ 4. Detection Head (Classification + BBox Regression)

Inputs:

  • RoI-aligned features: $R_i \in \mathbb{R}^{256 \times 7 \times 7}$

Network:

  • Two FC layers โ†’ feature vector $f_i \in \mathbb{R}^{1024}$
  • Heads:

    • Classification: $\hat{p}_i \in \mathbb{R}^{K+1}$ โ†’ softmax over classes
    • Regression: $\hat{t}_i \in \mathbb{R}^{(K+1) \times 4}$ โ†’ per-class bbox deltas

๐Ÿท๏ธ 5. Training Labels for Detection Head

For each RoI $r_i$:

1. Match to GT box using IoU

  • If IoU โ‰ฅ 0.5 โ†’ positive

    • Assign class label $c_i \in [1, K]$
    • Assign matched GT box $g_i$
  • If IoU < 0.5 โ†’ background

    • Assign label $c_i = 0$
    • No regression target

2. Regression Target Deltas

For positive RoIs $r_i = (x_r, y_r, w_r, h_r)$:

\(t_{xi}^* = \frac{x_{gt} - x_r}{w_r},\quad t_{yi}^* = \frac{y_{gt} - y_r}{h_r}, \quad t_{wi}^* = \log\left(\frac{w_{gt}}{w_r}\right),\quad t_{hi}^* = \log\left(\frac{h_{gt}}{h_r}\right)\) Note: These are class-specific โ€” only the GT class channel is trained.


๐Ÿงฎ 6. Detection Loss

Let:

  • $\hat{p}_i$: predicted class scores
  • $c_i$: GT class
  • $\hat{t}_{c_i}$: predicted deltas for class $c_i$
  • $t_i^*$: GT deltas for RoI $i$

Then the total loss:

\[L = \frac{1}{N_{cls}} \sum_i \text{CE}(\hat{p}_i, c_i) + \frac{1}{N_{reg}} \sum_i \mathbb{1}_{[c_i > 0]} \cdot \text{SmoothL1}(\hat{t}_{c_i}, t_i^*)\]
  • Classification: Cross-Entropy over K+1 classes
  • Regression: Smooth L1 for positive RoIs only

โœ… Summary Diagram

Anchors โ†’ RPN (cls + bbox) โ†’ Deltas + Scores โ†’ Proposals (RoIs)
     โ†’ RoIAlign (7x7x256)
         โ†’ FC โ†’ Classification Head โ†’ Class probs
         โ†’ FC โ†’ BBox Head โ†’ K-class regression

๐Ÿ“‰ 6. Losses

6.1 Detection Classification:

Excellent question. Letโ€™s clarify how detection is performed in Mask R-CNN and whether anchors are involved at that stage.


The detection head does not use anchors. Anchors are only used in the RPN (Region Proposal Network). The detection head operates on refined RoIs (region proposals) generated from RPN outputs.

1. Anchors Are Used Only in RPN

  • Anchors are generated at each location of FPN feature maps.
  • For each anchor:

    • RPN predicts: objectness score + bbox regression offsets.
  • Top scoring boxes (after NMS) are selected as region proposals (RoIs).

These RoIs are then passed to the next stage โ€” the detection head.


2. Detection Head Receives Aligned RoIs

  • RoIs are refined bounding boxes, not anchor templates.
  • They are extracted via RoIAlign into fixed-size features (e.g., 7ร—7ร—256).
  • These features are passed to the detection head.

๐Ÿ“ฆ Detection Head

Inputs:

  • RoI-aligned features of shape (B, 256, 7, 7)

Architecture:

  • 2 Fully Connected (FC) layers
  • Output:

    • Classification logits over K object classes
    • Class-specific bbox deltas (81 ร— 4 values in COCO)

Post-processing:

  • Apply bbox deltas to RoIs to get refined boxes.
  • Run softmax over classification logits.
  • Apply NMS per class to suppress redundant detections.

Stage Uses Anchors? Description
RPN โœ… Yes Anchors + regression โ†’ Proposals
Detection Head โŒ No Operates on RPN outputs (RoIs), classifies and refines boxes

So detection is not anchor-based, but rather proposal-based, which are refined anchor regressions.

Let me know if youโ€™d like to insert this explanation into the tutorial file.

\[L_{cls} = - \sum_{i} y_i \log(p_i) \quad \text{(softmax cross-entropy)}\]

6.2 BBox Regression:

\[L_{box} = \text{SmoothL1}(t_i, t_i^*)\]

6.3 Mask Segmentation:

  • Only 1 out of K channels is trained per RoI (the GT class)
  • So a softmax over channels is invalid

\(L_{mask} = \frac{1}{m^2} \sum_{i,j} \text{BCE}(M_k[i,j], M_k^*[i,j])\)

Why Mask R-CNN Uses BCE Instead of Softmax + Categorical Cross-Entropy (CCE)

Mask R-CNN uses Binary Cross-Entropy (BCE) for its mask head loss, not softmax with categorical cross-entropy (CCE). Hereโ€™s why:


๐Ÿงฉ Architecture Choice: One Binary Mask per Class

  • The mask head outputs K channels (e.g., K=81 for COCO), each of size 28x28.
  • Each channel represents a binary mask for one class.
  • During training, only the channel corresponding to the ground-truth class is supervised.

    • For example, if a given RoI is labeled as โ€œpersonโ€ (class 1), only the 2nd mask is trained using the binary mask for โ€œpersonโ€.
    • The other 80 channels are ignored.
    • This is because in general ROIs might overlap and segmentation masks for different ROIS might overlap. So it is not a categorical segmentation it could be multi labels.
    • In general for each ROI we calculated the mask loss separately.

๐Ÿšซ Why Softmax + CCE Would Be Invalid

Softmax + categorical cross-entropy assumes:

  1. All output channels represent mutually exclusive class probabilities.
  2. A softmax across channels produces a normalized distribution over all K classes at each pixel.
  3. You supervise every pixel to predict exactly one of the K classes.

But in Mask R-CNN:

  • You do not train all K channels. Only the one for the GT class is used.
  • The other channels are not supervised, so the softmax is ill-defined.
  • This violates the assumption required for softmax + CCE to work (i.e., supervision for all classes per pixel).

Class Agnostic Segmentation Head

In the case the number of classes are too large e.g. $K = 1000$ using 1000 segmentation masks in the model makes the model too heavy one solution is to use class agnostic heads.

๐Ÿ” What Happens with a Class-Agnostic Mask Head?

โœ… Output Shape

If your mask head is class-agnostic, it outputs:

\[\text{(N, 1, 28, 28)}\]

Where:

  • $N$ = number of RoIs (say 100 or 1000)
  • Each RoI is associated with one object instance in the scene

So this means:

  • Youโ€™re predicting N masks at once, one per RoI
  • NO for-loop is needed โ€” itโ€™s all in one batched tensor

๐Ÿ’ก Why Itโ€™s Still Scalable and Fast

  • Mask prediction is a batched operation
  • The input to the mask head is:

    \[\text{RoIAlign output} \rightarrow \text{(N, 256, 14, 14)} or (7,7)\]
  • It goes through conv layers โ†’ upsample โ†’ output:

    \[\text{(N, 1, 28, 28)} โ€” One mask per RoI\]

This is similar to how object detection works in parallel for all RoIs.


๐Ÿ“Œ Difference vs. Per-Class Mask (Standard Mask R-CNN)

Head Type Output Shape Loss Computed On Notes
Class-specific (N, K, 28, 28) Only GT class mask One-hot channel mask (train only one)
Class-agnostic (N, 1, 28, 28) Single mask per RoI No per-class modeling

So for 100 RoIs, you still predict 100 masks in parallel:

  • Just one channel per mask instead of K channels
  • Memory: N ร— 1 ร— 28 ร— 28 vs N ร— K ร— 28 ร— 28

๐Ÿค” Your Concern: Multiple Objects in the Scene

โ€œIf we have 100 objects in the scene, can we predict 100 masks?โ€

โœ… Yes โ€” each RoI corresponds to one object, and Mask R-CNN assigns each RoI to a GT box during training.

  • So if there are 100 objects, and the RPN proposes 100 good regions, youโ€™ll have 100 RoIs.
  • Each RoI gets one mask โ€” predicted simultaneously.

Concern Reality
โ€œOne mask output means only 1 object?โ€ โŒ No โ€” itโ€™s one mask per RoI, and RoIs = objects
โ€œNeed a for-loop to process RoIs?โ€ โŒ No โ€” itโ€™s fully batched: all RoIs processed in parallel
โ€œWhy is it faster?โ€ โœ… Smaller output tensor, no class-specific branching

๐ŸŽฏ Why Mask R-CNN Uses K Channels in the Segmentation Head

The original Mask R-CNN design by He et al. uses K binary masks per RoI, where:

  • $K$ = number of foreground classes (e.g., 80 in COCO)
  • Each RoI outputs (1 binary mask for each class) โ†’ shape:

    \[(N, K, 28, 28)\]

But hereโ€™s the crucial point:

At training time, only the mask corresponding to the GT class is supervised. At inference time, only the mask corresponding to the predicted class is used.


๐Ÿง  Why This Class-Specific Design Was Chosen

  1. Mask shape is often class-specific

    • The shape of a person, car, and cat differ significantly.
    • Having one mask branch per class allows the model to specialize shape priors.
  2. Shared computation, but specialized outputs

    • The same conv layers are used, but the final conv output has $K$ channels.
    • Only one of them is used โ€” but the learning capacity is still valuable.

โœ… So Do We Need K Channels?

No, not strictly. You can absolutely modify Mask R-CNN to use:

๐Ÿ”น Class-Agnostic Mask Head:

\[(N, 1, 28, 28)\]

And predict just one mask per RoI, regardless of class.

This is:

  • Lighter
  • Faster
  • Often almost as accurate
  • Especially useful when $K$ is large (e.g., 1000+ classes)

โš–๏ธ Trade-off Table

Design Output Shape Memory Accuracy Class-specific shape?
Per-class (original) (N, K, 28, 28) High โœ… better โœ… yes
Class-agnostic (N, 1, 28, 28) Low โš ๏ธ slightly lower โŒ no

๐Ÿ› ๏ธ In Practice

  • Detectron2, MMDetection, and other frameworks offer class-agnostic mode.
  • Empirically, class-agnostic masks work well in many settings โ€” especially with:

    • Few training samples per class
    • Lightweight models
    • Edge deployment

โœ… Conclusion

Mask R-CNN does not fundamentally need K channels in the mask output.

It was a design choice for better accuracy and class-specific modeling. Class-agnostic heads are a valid, simpler, and often preferable alternative when scalability matters.


โœ… Why BCE Works for Mask R-CNN

  • BCE is applied per-pixel, per-class, independently.
  • It models: โ€œIs this pixel foreground for this class?โ€ (binary yes/no).
  • Since we train only the GT class channel, BCE perfectly fits this logic.
  • It allows us to treat each mask as a separate binary segmentation task.

๐Ÿšจ What Happens If You Use Softmax + CCE Anyway?

If you incorrectly apply softmax across channels:

  • It would force the network to produce a probability distribution over all classes per pixel.
  • Since only one class is being supervised (GT class), the model has no ground truth to supervise the other K-1 classes.
  • This leads to unstable training, false gradients, and degraded performance.

In short: Softmax + CCE is semantically wrong for per-RoI binary masks trained only for GT class. BCE is correct because it models the actual training behavior: 1 mask per GT class per RoI.


To redesign Mask R-CNN to use softmax over mask outputs with categorical cross-entropy (CCE), you must change how masks are predicted and supervised. This fundamentally alters the mask head architecture and training logic.


โœ… Objective: Softmax Mask Prediction with CCE

We want each pixel in the predicted mask to output a single class label (as in semantic segmentation). This requires:

Requirement Current Mask R-CNN Modified Design (Softmax)
Per-pixel prediction Binary (1 per class) Categorical (1-of-K)
Channels in mask head output K binary masks (Bร—Kร—28ร—28) 1 categorical mask (Bร—Kร—28ร—28)
Supervised channels Only GT class mask All pixels with class label
Loss BCE on GT class CCE over softmax(K)

๐Ÿ”ง Design Changes Required

1. Mask Head Output Remains (B, K, 28, 28)

No change to the output shape โ€” we still predict K channels per RoI.

2. Supervision Must Change

You must now provide a categorical mask label for each pixel in the RoI crop. That is, instead of training a binary mask only for the GT class, you now train a full pixel-wise class map with values in [0, ..., K-1].

This is hard because:

  • Each RoI corresponds to only one object, which is of a single class.
  • There is no semantic reason for pixels within an RoI to have multiple class labels.

To make this work:

  • Youโ€™d need to merge overlapping masks and assign per-pixel class labels (like semantic segmentation).
  • This makes it no longer an instance segmentation problem, but semantic segmentation.

3. Loss Function

Replace BCE with softmax + cross-entropy:

# logits: (B, K, 28, 28)
# targets: (B, 28, 28) with values in [0, K-1]
loss = nn.CrossEntropyLoss()(logits, targets)

4. Mask Target Construction

Instead of a binary mask per RoI:

  • You would need to project instance masks into a shared canvas per RoI.
  • And resolve overlaps into a single class label per pixel.

This contradicts the core instance-level logic of Mask R-CNN.


๐Ÿšซ Why This is Usually Not Done

  • RoIs are defined per instance, not for the whole image.
  • There is no natural multi-class pixel label per RoI.
  • The model should only answer: โ€œWhere is the mask for this object?โ€, not โ€œWhich class does each pixel belong to?โ€ โ€” thatโ€™s a semantic segmentation task, not instance segmentation.

โœ… Summary

To support softmax + CCE for segmentation masks in Mask R-CNN:

Step Change Required
1 Mask targets must be class labels per pixel
2 Mask loss becomes CCE instead of BCE
3 Every mask channel must be supervised jointly
4 Overlapping instances must be resolved by pixel class

But this destroys instance separation, which is the entire point of Mask R-CNN. This change essentially turns the model into a semantic segmentation network, not an instance segmentation one.


โœ… Summary

  • RPN has its own cls + reg heads like SSD
  • Anchors matched using IoU, balanced with hard negative mining
  • RoIAlign allows precise fixed-size feature extraction
  • Only the GT class mask is supervised, so BCE is correct
  • Softmax + CCE would require all K masks to compete, which is not the design here