Learning Muse by Mehdi Seyfi

Of course! This is a fantastic topic that sits at the heart of machine learning. Let’s break down Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP) with clear intuitions, mathematical derivations, and concrete examples from neural networks and computer vision.

The Core Idea: What Are We Estimating?

Imagine you have a model of the world—it could be a simple coin, a line fitting some data, or a complex neural network. This model has parameters that define its behavior. For a coin, the parameter is the probability of getting heads. For a line, it’s the slope and intercept. For a neural network, it’s the millions of weights and biases.

Our goal is to find the best possible values for these parameters based on data we’ve observed. MLE and MAP are two different philosophies for doing this.

MLE asks: What parameters make my observed data most probable?
MAP asks: Given my observed data and my prior beliefs about what the parameters should be, what are the most probable parameters?

The key difference is the “prior beliefs.” Let’s dive in.

Maximum Likelihood Estimation (MLE)

MLE is the more straightforward of the two. It trusts the data completely and finds the parameters that best explain that data, without any preconceived notions.

Intuition: The Biased Coin 🪙

Imagine you find a strange coin. You don’t know if it’s fair. To figure it out, you flip it 10 times and get the following sequence:

Heads, Tails, Heads, Heads, Heads, Tails, Heads, Heads, Tails, Heads (That’s 7 Heads and 3 Tails)

What is your best guess for the probability of this coin landing on heads, let’s call it $\theta$?

Your intuition probably screams 7/10 or 0.7. This intuitive answer is precisely the Maximum Likelihood Estimate. Why? Because if the true probability of heads were 0.7, the sequence you observed would be more likely than if the probability were 0.5, 0.6, or 0.8. You are choosing the parameter $\theta$ that maximizes the probability (the likelihood) of your observed data.

Mathematical Formulation

Let’s formalize this.

Define the Likelihood Function: Let our dataset be $D = {x_1, x_2, …, x_N}$ and our model parameters be $\theta$. The likelihood function $L(\theta \mid D)$ is defined as the probability of observing the data $D$ given the parameters $\theta$.
\[L(\theta \mid D) = P(D \mid \theta)\]
If we assume our data points are independent and identically distributed (i.i.d.), we can write this as a product:
\[L(\theta \mid D) = \prod_{i=1}^{N} P(x_i \mid \theta)\]
Maximize the Likelihood: The goal of MLE is to find the parameters $\hat{\theta}_{MLE}$ that maximize this function.
\[\hat{\theta}_{MLE} = \arg\max_{\theta} L(\theta \mid D)\]
Use the Log-Likelihood: Products are mathematically cumbersome to differentiate. Since the logarithm is a monotonically increasing function, maximizing a function is the same as maximizing its logarithm. This turns our product into a much friendlier sum.
\[\hat{\theta}_{MLE} = \arg\max_{\theta} \log \left( \prod_{i=1}^{N} P(x_i \mid \theta) \right) = \arg\max_{\theta} \sum_{i=1}^{N} \log P(x_i \mid \theta)\]
This is called the log-likelihood.

MLE and Neural Network Loss Functions

This is where the magic happens. Many common loss functions in machine learning are derived directly from the principle of MLE. Training a model is often just maximizing the log-likelihood (or, equivalently, minimizing the negative log-likelihood).

Example 1: Regression and Mean Squared Error (MSE)

Let’s say we’re training a neural network for a regression task (e.g., predicting house prices). Our network takes an input $x_i$ (features of a house) and predicts an output $f(x_i; \theta)$.

Assumption: We assume the true target value $y_i$ is our network’s prediction plus some Gaussian (normal) noise: $y_i \sim \mathcal{N}(f(x_i; \theta), \sigma^2)$. This is a very common and reasonable assumption, stating our model is “about right” but there’s some random error.
Likelihood: The probability density of observing a single data point $y_i$ is given by the Gaussian PDF:
\[P(y_i \mid x_i; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f(x_i; \theta))^2}{2\sigma^2}\right)\]
Negative Log-Likelihood: Let’s find the total negative log-likelihood for our dataset $D = {(x_1, y_1), …, (x_N, y_N)}$.
\[\begin{aligned} -\log L(\theta\mid D) &=& -\log \prod_{i=1}^{N} P(y_i \mid x_i; \theta)\\ &=& -\sum_{i=1}^{N} \log P(y_i \mid x_i; \theta)\\ &=& -\sum_{i=1}^{N} \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f(x_i; \theta))^2}{2\sigma^2}\right) \right)\\ &=& -\sum_{i=1}^{N} \left( \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) - \frac{(y_i - f(x_i; \theta))^2}{2\sigma^2} \right)\\ &=& \sum_{i=1}^{N} \frac{(y_i - f(x_i; \theta))^2}{2\sigma^2} + \sum_{i=1}^{N} \log(\sqrt{2\pi\sigma^2}) \end{aligned}\]
The Loss Function: We want to find the $\theta$ that minimizes this negative log-likelihood. The terms $2\sigma^2$ and the second sum are constants with respect to $\theta$, so we can drop them for the purpose of optimization. We are left with:
\[\text{Loss}(\theta) \propto \sum_{i=1}^{N} (y_i - f(x_i; \theta))^2\]
This is precisely the Sum of Squared Errors (SSE), which when averaged, is the Mean Squared Error (MSE) loss function!

Key Takeaway: Using MSE for regression is equivalent to performing MLE under the assumption that the data has Gaussian noise.

Example 2: Classification and Cross-Entropy Loss

Let’s consider a multi-class classification problem (e.g., ImageNet). Our network’s final layer is a softmax function, which outputs a probability distribution over the classes. For an input $x_i$, the network outputs a vector of probabilities $p_i = [p_{i,1}, p_{i,2}, …, p_{i,C}]$, where $p_{i,c}$ is the predicted probability that sample $i$ belongs to class $c$.

Assumption: The model’s output represents the parameters of a Categorical distribution.
Likelihood: Let the true label $y_i$ be a one-hot encoded vector (e.g., [0, 0, 1, 0] for class 3). The probability of observing this true label is simply the probability our model assigned to that class.
\[P(y_i \mid x_i; \theta) = \prod_{c=1}^{C} p_{i,c}^{y_{i,c}}\]
Since $y_{i,c}$ is 1 for the correct class and 0 for all others, this product just picks out the single correct probability.
Negative Log-Likelihood:
\[\begin{aligned} -\log L(\theta\mid D) &=& -\sum_{i=1}^{N} \log P(y_i \mid x_i; \theta)\\ &=& -\sum_{i=1}^{N} \log \left( \prod_{c=1}^{C} p_{i,c}^{y_{i,c}} \right)\\ &=& -\sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(p_{i,c})\\ \end{aligned}\]
The Loss Function: This final expression is exactly the definition of the Cross-Entropy Loss used in classification!

Key Takeaway: Minimizing cross-entropy loss is equivalent to performing MLE on your classification model.

Maximum a Posteriori (MAP) Estimation

MAP starts with MLE but adds a crucial ingredient: a prior belief. It doesn’t trust the data blindly, especially when the data is scarce.

Intuition: The Suspicious Coin 🤔

Let’s go back to the coin flip. You flipped it 10 times and got 7 heads. MLE says the probability of heads is 0.7.

But what if I told you the coin came from the national mint, which has extremely strict quality control, and you have a very strong prior belief that the coin is fair ($\theta=0.5$)? Would a mere 10 flips be enough to convince you it’s a trick coin? Probably not. You might think, “Well, 7 out of 10 is a bit skewed, but it’s likely due to random chance. My best guess is still close to 0.5, maybe 0.55.”

This is the MAP mindset. It finds a balance between your prior belief (the coin is fair) and the evidence from the data (7 heads out of 10). The final estimate is a compromise, “pulled” from the MLE estimate toward the prior.

Mathematical Formulation

MAP uses Bayes’ theorem to update our beliefs.

Bayes’ Theorem: It connects the posterior probability (what we want) to the likelihood and the prior.
\[\underbrace{P(\theta \mid D)}_{\text{Posterior}} = \frac{\overbrace{P(D \mid \theta)}^{\text{Likelihood}} \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(D)}_{\text{Evidence}}}\]
Maximize the Posterior: The goal of MAP is to find the parameters that maximize the posterior probability.
\[\hat{\theta}_{MAP} = \arg\max_{\theta} P(\theta \mid D) = \arg\max_{\theta} \frac{P(D \mid \theta) P(\theta)}{P(D)}\]
Since the evidence $P(D)$ doesn’t depend on $\theta$, we can drop it from the maximization:
\[\hat{\theta}_{MAP} = \arg\max_{\theta} P(D \mid \theta) P(\theta)\]
Use the Log-Posterior: Again, we use logs to make the math easier.
\[\hat{\theta}_{MAP} = \arg\max_{\theta} [\log P(D \mid \theta) + \log P(\theta)]\]

Notice the formula! The MAP estimate maximizes the log-likelihood (the MLE term) PLUS the log of the prior.

MAP and Neural Network Loss Functions (Regularization!)

The prior belief term, $\log P(\theta)$, is what introduces regularization into our loss functions. Regularization is a technique used to prevent overfitting by penalizing complex models.

Example 1: Regression and L2 Regularization (Ridge)

Let’s revisit our regression problem.

Likelihood: Same as before, leading to the MSE term.
Prior Belief: We’ll now add a prior on the network weights $\theta$. A very common choice is a Gaussian prior, centered at zero: $\theta \sim \mathcal{N}(0, \beta^2 I)$. This expresses a belief that most weights should be small and close to zero. This is a great way to encourage a “simpler” model.
Log-Prior: The PDF for this prior is $P(\theta) \propto \exp\left(-\frac{\mid \theta\mid ^2_2}{2\beta^2}\right)$, where $\mid \theta\mid ^2_2 = \sum_j \theta_j^2$ is the squared L2-norm. The log of the prior is:
\[\log P(\theta) = -\frac{\mid \theta\mid ^2_2}{2\beta^2} + \text{constant}\]
The Loss Function: To get our final MAP objective, we minimize the negative log-posterior.
\[\begin{aligned} \text{Loss}(\theta) &=& -\log P(D \mid \theta) - \log P(\theta)\\ &\propto& \left[ \sum_{i=1}^{N} (y_i - f(x_i; \theta))^2 \right] + \left[ \frac{\sigma^2}{\beta^2} \mid \theta\mid ^2_2 \right] \end{aligned}\]
If we define a regularization strength $\lambda = \frac{\sigma^2}{\beta^2}$, we get:
\[\text{Loss}(\theta) = \underbrace{\sum_{i=1}^{N} (y_i - f(x_i; \theta))^2}_{\text{MSE (from Likelihood)}} + \underbrace{\lambda \mid \theta\mid ^2_2}_{\text{L2 Regularization (from Prior)}}\]
This is the classic loss function for Ridge Regression or a neural network trained with L2 Regularization (also called weight decay).

Key Takeaway: Adding an L2 regularization term to your loss is mathematically equivalent to performing MAP estimation with a Gaussian prior on your model’s weights.

Computer Vision Example: Overfitting Classifier

Imagine you’re training a Convolutional Neural Network (CNN) to classify dogs vs. cats, but you only have 50 training images.

MLE Approach (e.g., Cross-Entropy Loss only): The network might achieve 100% accuracy on your 50 images. It could do this by memorizing them perfectly, learning spurious features like “the dog in image #3 has a red collar” or “the cat in image #8 is on a specific green blanket.” The network’s weights might become very large and highly specialized. When you show it a new image of a dog without a red collar, it will likely fail. This is overfitting.
MAP Approach (e.g., Cross-Entropy + L2 Regularization): The L2 regularization term (our Gaussian prior) penalizes large weights. The model is now incentivized to not only classify the training data correctly (the likelihood term) but also to keep its weights small (the prior term). This forces it to find simpler, more general patterns (like “pointy ears,” “whiskers,” “snout shape”) instead of memorizing specific pixels. The resulting model will have lower training accuracy (maybe 96%) but will generalize much better to new, unseen images.

MLE vs. MAP: Summary

Feature	Maximum Likelihood Estimation (MLE)	Maximum a Posteriori (MAP)
Philosophy	Frequentist. Finds parameters that make the data most likely.	Bayesian. Finds the most probable parameters given the data and a prior belief.
Formula	$\arg\max_{\theta} P(D \mid \theta)$	$\arg\max_{\theta} P(D \mid \theta) P(\theta)$
Neural Networks	Corresponds to standard loss functions like MSE and Cross-Entropy.	Corresponds to loss functions with regularization (e.g., L2, L1).
Overfitting	More prone to overfitting, especially with small datasets.	Less prone to overfitting, as the prior “regularizes” the solution.
Data Size	As data size → ∞, the MAP estimate converges to the MLE estimate. The influence of the prior gets washed out by the overwhelming evidence from the data.	The prior has the strongest influence when the dataset is small.

By understanding MLE and MAP, you gain a much deeper insight into why we use the loss functions and regularization techniques that are so fundamental to modern machine learning and computer vision. You’re not just throwing L2(0.01) into your code; you’re applying a Gaussian prior belief to your model’s weights to find a more robust solution.