Motion Estimation and Compensation in Video Coding: A Comprehensive Guide

Introduction

Motion estimation and motion compensation are critical components in video compression algorithms. They exploit temporal redundancies between consecutive frames in a video sequence to reduce the amount of data required for efficient storage and transmission. By predicting the motion of objects from one frame to another, we can represent a video more compactly without significantly compromising visual quality.

This guide provides an in-depth exploration of motion estimation and compensation, covering:

Mathematical foundations
Detailed algorithms with step-by-step explanations
Code implementations with inline comments
Flowcharts illustrating the algorithms
Application of motion estimation in video coding

Mathematical Background
Motion Estimation Algorithms
Implementation of Algorithms
Flowcharts of Algorithms
Motion Estimation in Video Coding
- 5.1. Steps in Video Encoding with Motion Estimation
Conclusion
References

1. Mathematical Background

Block-Based Motion Estimation

In video sequences, motion estimation is typically performed on a block-by-block basis. An image frame is divided into non-overlapping blocks of pixels (e.g., 16x16, 8x8). The goal is to find, for each block in the current frame, the best matching block within a search window in a reference frame (usually the previous frame).

Motion Vector (MV): The displacement between the position of the block in the current frame and the position of the best matching block in the reference frame.

Error Metrics

To determine the best match, an error metric quantifies the similarity between blocks. Commonly used error metrics include:

Sum of Absolute Differences (SAD):
\[\text{SAD} = \sum_{i=1}^{N} \sum_{j=1}^{N} |C(i, j) - R(i + \delta x, j + \delta y)|\]
Mean Squared Error (MSE):
\[\text{MSE} = \frac{1}{N^2} \sum_{i=1}^{N} \sum_{j=1}^{N} [C(i, j) - R(i + \delta x, j + \delta y)]^2\]
Peak Signal-to-Noise Ratio (PSNR):
\[\text{PSNR} = 10 \log_{10}\left( \frac{L^2}{\text{MSE}} \right)\]
Where \(L\) is the maximum possible pixel value.

Variables:

\(C(i, j)\): Pixel value at position \((i, j)\) in the current block.
\(R(i + \delta x, j + \delta y)\): Pixel value at the corresponding position in the reference block shifted by motion vector \((\delta x, \delta y)\).

2. Motion Estimation Algorithms

Motion estimation algorithms aim to efficiently find the best matching block in the reference frame. We’ll discuss several algorithms:

Exhaustive Search (Full Search)
Three-Step Search
Diamond Search
Hierarchical (Multi-Resolution) Search

2.1. Exhaustive Search (Full Search) Algorithm

Description:

The most straightforward method.
Searches every possible position within the search window.
Guarantees the optimal match but is computationally expensive.

Algorithm Steps:

For each block in the current frame:
- Initialize minimum error to infinity.
- For each candidate block within the search window in the reference frame:
  - Compute the error metric (e.g., SAD).
  - If the computed error is less than the minimum error:
    - Update the minimum error.
    - Store the motion vector corresponding to the candidate block.

2.2. Three-Step Search Algorithm

Description:

Reduces computational complexity compared to exhaustive search.
Uses a coarse-to-fine search strategy.

Algorithm Steps:

Initialization:
- Set the initial step size \(s = 4\) (or \(s = \text{Maximum Search Range} / 2\)).
- Start at the center of the search window.
First Step:
- Evaluate the nine points around the center (including the center) with displacements \(\{(-s, -s), (-s, 0), (-s, s), (0, -s), (0, 0), (0, s), (s, -s), (s, 0), (s, s)\}\).
- Find the point with the minimum error.
Second Step:
- Reduce the step size \(s = s / 2\).
- Center the search around the best point from the previous step.
- Evaluate the nine points as before.
Third Step:
- Repeat the process until \(s = 1\).

2.3. Diamond Search Algorithm

Description:

Further reduces computations compared to three-step search.
Uses a diamond-shaped pattern for searching.

Algorithm Steps:

Large Diamond Search Pattern (LDSP):
- Use a diamond-shaped pattern to search around the current point.
Search Process:
- Start at the center.
- Apply LDSP.
- If the minimum error point is at the center:
  - Switch to Small Diamond Search Pattern (SDSP).
- Else:
  - Move the center to the minimum error point and repeat LDSP.
Termination:
- When SDSP is applied, and the minimum error point is at the center, the search ends.

2.4. Hierarchical (Multi-Resolution) Search Algorithm

Description:

Uses multiple resolutions of the frames (pyramids).
Coarse motion is estimated at lower resolutions.
Refinement is done at higher resolutions.

Algorithm Steps:

Construct Image Pyramids:
- Create a Gaussian pyramid for both current and reference frames.
Coarse Level Search:
- At the lowest resolution, perform motion estimation using a simple algorithm.
Refinement:
- Use the estimated motion vectors as a starting point for the next higher resolution.
- Refine the motion vectors at each level.

3. Implementation of Algorithms

Let’s implement each algorithm using Python and NumPy. We will use grayscale images for simplicity.

Common Setup

import numpy as np
import matplotlib.pyplot as plt
from skimage import io, color

# Load frames
frame1 = io.imread('frame1.png')  # Reference frame
frame2 = io.imread('frame2.png')  # Current frame

# Convert to grayscale
frame1_gray = color.rgb2gray(frame1)
frame2_gray = color.rgb2gray(frame2)

# Ensure frames are the same size
assert frame1_gray.shape == frame2_gray.shape, "Frames must be the same size"

# Parameters
block_size = 16
search_range = 7  # Maximum displacement in pixels

height, width = frame1_gray.shape

# Initialize motion vector field
mv_field = np.zeros((height // block_size, width // block_size, 2), dtype=np.int)

3.1. Exhaustive Search Implementation

# Exhaustive Search Motion Estimation
for i in range(0, height - block_size + 1, block_size):
    for j in range(0, width - block_size + 1, block_size):
        # Current block
        current_block = frame2_gray[i:i+block_size, j:j+block_size]

        min_mse = float('inf')
        best_mv = (0, 0)

        # Search window boundaries in the reference frame
        i_min = max(i - search_range, 0)
        i_max = min(i + search_range, height - block_size)
        j_min = max(j - search_range, 0)
        j_max = min(j + search_range, width - block_size)

        # Exhaustive search over the search window
        for m in range(i_min, i_max + 1):
            for n in range(j_min, j_max + 1):
                reference_block = frame1_gray[m:m+block_size, n:n+block_size]
                mse = np.mean((current_block - reference_block) ** 2)

                if mse < min_mse:
                    min_mse = mse
                    best_mv = (m - i, n - j)

        mv_field[i // block_size, j // block_size] = best_mv

Explanation:

For each block in the current frame, we search all possible positions within the search window in the reference frame.
We compute the Mean Squared Error (MSE) between the current block and each candidate block.
We select the motion vector corresponding to the candidate block with the minimum MSE.

3.2. Three-Step Search Implementation

# Three-Step Search Motion Estimation
for i in range(0, height - block_size + 1, block_size):
    for j in range(0, width - block_size + 1, block_size):
        # Current block
        current_block = frame2_gray[i:i+block_size, j:j+block_size]

        center_m, center_n = i, j
        step_size = max(2 ** (int(np.ceil(np.log2(search_range))) - 1), 1)

        while step_size >= 1:
            min_mse = float('inf')
            best_mv = (0, 0)

            # Search points
            for m in range(-step_size, step_size + 1, step_size):
                for n in range(-step_size, step_size + 1, step_size):
                    ref_m = center_m + m
                    ref_n = center_n + n

                    if (0 <= ref_m <= height - block_size) and (0 <= ref_n <= width - block_size):
                        reference_block = frame1_gray[ref_m:ref_m+block_size, ref_n:ref_n+block_size]
                        mse = np.mean((current_block - reference_block) ** 2)

                        if mse < min_mse:
                            min_mse = mse
                            best_mv = (ref_m - i, ref_n - j)
                            best_m, best_n = ref_m, ref_n

            center_m, center_n = best_m, best_n
            step_size = step_size // 2

        mv_field[i // block_size, j // block_size] = (center_m - i, center_n - j)

Explanation:

Starts with a large step size and progressively reduces it.
At each step, searches the 9 points around the current center.
Updates the center to the point with the minimum MSE.
Repeats until the step size becomes zero.

3.3. Diamond Search Implementation

# Diamond Search Motion Estimation
LDSP = [(-1, 0), (0, -1), (0, 0), (0, 1), (1, 0)]
SDSP = [(-1, 0), (0, -1), (0, 0), (0, 1), (1, 0)]

for i in range(0, height - block_size + 1, block_size):
    for j in range(0, width - block_size + 1, block_size):
        # Current block
        current_block = frame2_gray[i:i+block_size, j:j+block_size]

        center_m, center_n = i, j
        pattern = LDSP
        search = True

        while search:
            min_mse = float('inf')
            best_mv = (0, 0)
            for (dm, dn) in pattern:
                ref_m = center_m + dm
                ref_n = center_n + dn

                if (0 <= ref_m <= height - block_size) and (0 <= ref_n <= width - block_size):
                    reference_block = frame1_gray[ref_m:ref_m+block_size, ref_n:ref_n+block_size]
                    mse = np.mean((current_block - reference_block) ** 2)

                    if mse < min_mse:
                        min_mse = mse
                        best_mv = (ref_m - i, ref_n - j)
                        best_m, best_n = ref_m, ref_n

            if (best_m == center_m) and (best_n == center_n):
                if pattern == LDSP:
                    pattern = SDSP
                else:
                    search = False
            else:
                center_m, center_n = best_m, best_n

        mv_field[i // block_size, j // block_size] = (center_m - i, center_n - j)

Explanation:

Uses two patterns: Large Diamond Search Pattern (LDSP) and Small Diamond Search Pattern (SDSP).
Begins with LDSP; if the best match is at the center, switches to SDSP.
Continues until the center point is the best match in SDSP.

3.4. Hierarchical Search Implementation

from skimage.transform import pyramid_gaussian

# Create Gaussian pyramids
pyramid_levels = 3
pyramid_frame1 = tuple(pyramid_gaussian(frame1_gray, max_layer=pyramid_levels - 1, downscale=2))
pyramid_frame2 = tuple(pyramid_gaussian(frame2_gray, max_layer=pyramid_levels - 1, downscale=2))

# Initialize motion vectors at the coarsest level
mv_field_pyramid = [None] * pyramid_levels
for level in reversed(range(pyramid_levels)):
    frame1_level = pyramid_frame1[level]
    frame2_level = pyramid_frame2[level]

    scale = 2 ** level
    height_level, width_level = frame1_level.shape
    mv_field_level = np.zeros((height_level // block_size, width_level // block_size, 2), dtype=np.int)

    for i in range(0, height_level - block_size + 1, block_size):
        for j in range(0, width_level - block_size + 1, block_size):
            # Current block
            current_block = frame2_level[i:i+block_size, j:j+block_size]

            if level == pyramid_levels - 1:
                # Coarsest level: use exhaustive search
                min_mse = float('inf')
                best_mv = (0, 0)

                # Search window (small at coarsest level)
                search_range_level = 1
                i_min = max(i - search_range_level, 0)
                i_max = min(i + search_range_level, height_level - block_size)
                j_min = max(j - search_range_level, 0)
                j_max = min(j + search_range_level, width_level - block_size)

                for m in range(i_min, i_max + 1):
                    for n in range(j_min, j_max + 1):
                        reference_block = frame1_level[m:m+block_size, n:n+block_size]
                        mse = np.mean((current_block - reference_block) ** 2)

                        if mse < min_mse:
                            min_mse = mse
                            best_mv = (m - i, n - j)

                mv_field_level[i // block_size, j // block_size] = best_mv
            else:
                # Higher levels: refine motion vectors
                prev_mv = mv_field_pyramid[level + 1][i // (2 * block_size), j // (2 * block_size)] * 2
                center_m = i + prev_mv[0]
                center_n = j + prev_mv[1]

                min_mse = float('inf')
                best_mv = (0, 0)

                search_range_level = 1
                for dm in range(-search_range_level, search_range_level + 1):
                    for dn in range(-search_range_level, search_range_level + 1):
                        ref_m = center_m + dm
                        ref_n = center_n + dn

                        if (0 <= ref_m <= height_level - block_size) and (0 <= ref_n <= width_level - block_size):
                            reference_block = frame1_level[ref_m:ref_m+block_size, ref_n:ref_n+block_size]
                            mse = np.mean((current_block - reference_block) ** 2)

                            if mse < min_mse:
                                min_mse = mse
                                best_mv = (ref_m - i, ref_n - j)

                mv_field_level[i // block_size, j // block_size] = best_mv

    mv_field_pyramid[level] = mv_field_level

# The final motion vector field is at level 0
mv_field = mv_field_pyramid[0]

Explanation:

Constructs Gaussian pyramids of the frames.
Performs motion estimation starting from the coarsest level.
At each finer level, refines the motion vectors obtained from the previous level.
Motion vectors are scaled appropriately when moving between levels.

4. Flowcharts of Algorithms

4.1. Exhaustive Search Flowchart

Start
 |
For each block in current frame
 |
For each candidate position in search window
 |
Compute error metric (e.g., SAD)
 |
If error < min_error
 |
Update min_error and best_mv
 |
Store best_mv for the block
 |
End

4.2. Three-Step Search Flowchart

Start
 |
For each block in current frame
 |
Initialize step_size = max_search_range / 2
 |
While step_size >= 1
 |
Evaluate 9 points around center
 |
Find point with min_error
 |
Update center to best point
 |
step_size = step_size / 2
 |
Store best_mv for the block
 |
End

4.3. Diamond Search Flowchart

Start
 |
For each block in current frame
 |
Initialize center position
 |
Set pattern = LDSP
 |
While searching
 |
Apply pattern around center
 |
Find point with min_error
 |
If center is best point
 |
If pattern == LDSP
 |
Switch to SDSP
 |
Else
 |
Search = False
 |
Else
 |
Update center to best point
 |
Store best_mv for the block
 |
End

4.4. Hierarchical Search Flowchart

Start
 |
Construct image pyramids
 |
For level from coarsest to finest
 |
If coarsest level
 |
Perform motion estimation (e.g., exhaustive)
 |
Else
 |
Use previous level's mv as starting point
 |
Refine motion vectors
 |
Store mv for the level
 |
End

5. Motion Estimation in Video Coding

Motion estimation is integral to video coding standards like MPEG, H.264/AVC, HEVC, and beyond. It enables efficient inter-frame compression by predicting frames based on motion information from reference frames.

5.1. Steps in Video Encoding with Motion Estimation

Frame Partitioning:
- Divide the current frame into macroblocks (e.g., 16x16 pixels).
Motion Estimation:
- For each macroblock, perform motion estimation to find the best matching block in the reference frame.
- Determine the motion vector.
Motion Compensation:
- Use the motion vector to predict the macroblock from the reference frame.
- Compute the residual (difference) between the actual macroblock and the predicted macroblock.
Transform and Quantization:
- Apply a transform (e.g., DCT) to the residual.
- Quantize the transform coefficients to reduce precision and achieve compression.
Entropy Coding:
- Encode the quantized coefficients and motion vectors using entropy coding techniques (e.g., Huffman coding, arithmetic coding).
Reconstruction for Reference Frames:
- In the decoder, reconstruct the macroblock by inverse quantization and inverse transform.
- Add the reconstructed residual to the predicted macroblock.
- Store the reconstructed frame for future predictions.
Output Bitstream:
- Combine encoded motion vectors and residual data into a compressed bitstream.

Flowchart of Video Encoding with Motion Estimation:

Input Frame
    |
[Frame Partitioning]
    |
Divide into Macroblocks
    |
For each Macroblock:
    |
[Motion Estimation]
    |
Determine Motion Vector
    |
[Motion Compensation]
    |
Predict Macroblock
    |
[Compute Residual]
    |
Residual = Actual - Predicted
    |
[Transform (e.g., DCT)]
    |
[Quantization]
    |
[Entropy Coding]
    |
Encode Motion Vector and Quantized Coefficients
    |
[Reconstruction]
    |
Inverse Quantization and Transform
    |
Add Predicted Macroblock
    |
Store Reconstructed Frame
    |
[Output Bitstream]

6. Conclusion

Motion estimation and compensation are fundamental techniques in video compression, enabling efficient encoding by leveraging temporal redundancies. Various algorithms offer trade-offs between computational complexity and estimation accuracy. Understanding these algorithms and their implementations is crucial for developing efficient video coding systems and improving existing standards.

7. References

Richardson, Iain E. G. H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia. John Wiley & Sons, 2004.
Sullivan, Gary J., et al. “Overview of the High Efficiency Video Coding (HEVC) standard.” IEEE Transactions on Circuits and Systems for Video Technology 22.12 (2012): 1649-1668.
K. R. Rao, D. N. Kim, J. J. Hwang. Video Coding Standards: AVS China, H.264/MPEG-4 Part 10, HEVC, VP6, DIRAC and VC-1. Springer, 2014.
Tourapis, Alexis Michael. “Enhanced predictive zonal search for single and multiple frame motion estimation.” Visual Communications and Image Processing 2002. Vol. 4671. International Society for Optics and Photonics, 2002.

Note: The code examples provided are for educational purposes and may not be optimized for performance. In practice, optimized algorithms and hardware accelerations are used to achieve real-time video encoding.

Motion Estimation and Compensation in Video Coding: A Comprehensive Guide

Introduction

Table of Contents

1. Mathematical Background

Block-Based Motion Estimation

Error Metrics

2. Motion Estimation Algorithms

2.1. Exhaustive Search (Full Search) Algorithm

2.2. Three-Step Search Algorithm

2.3. Diamond Search Algorithm

2.4. Hierarchical (Multi-Resolution) Search Algorithm

3. Implementation of Algorithms

Common Setup

3.1. Exhaustive Search Implementation

3.2. Three-Step Search Implementation

3.3. Diamond Search Implementation

3.4. Hierarchical Search Implementation

4. Flowcharts of Algorithms

4.1. Exhaustive Search Flowchart

4.2. Three-Step Search Flowchart

4.3. Diamond Search Flowchart

4.4. Hierarchical Search Flowchart

5. Motion Estimation in Video Coding

5.1. Steps in Video Encoding with Motion Estimation

6. Conclusion

7. References