Deep Learning Chapter 8.1: How Learning Differs from Pure Optimization
Pure Optimization vs. Machine Learning
Pure Optimization
The objective of pure optimization:
\[ \min_{\theta}J(\theta) \]
In pure optimization, the function is determined and usually has a solution. For example, \(\min f(x)=(x-3)^2\) has solution \(x=3\).
Machine Learning
In machine learning, we optimize the empirical risk:
\[ J(\theta)=\mathbb{E}_{(x,y)\sim \hat{p}_{data}}L(f(x;\theta),y) \tag{8.1} \]
where: - \(L\) is the loss function - \(f(x;\theta)\) is the model output - \(\hat{p}_{data}\) is the empirical distribution on training data
In machine learning, our true goal is to minimize the expected loss on the true distribution, but we can only minimize the empirical loss on the training set. Thus, the optimization problem is only an approximation — and it often has no exact solution.
\(\hat{p}_{data}\) approximates the true data distribution \(p_{data}\) by assigning equal probability to each training sample.
Expected Risk
Generally, our ultimate goal is to minimize the expected loss under the true data distribution \(p_{data}\), rather than the empirical distribution \(\hat{p}_{data}\) estimated from the training data.
\[ J^*(\theta)=\mathbb{E}_{(x,y)\sim {p}_{data}}L(f(x;\theta),y) \tag{8.2} \]
Empirical Risk Minimization
The challenge in machine learning is that we only have access to the empirical distribution \(\hat{p}_{data}(x, y)\) derived from the training set, but the true data distribution \(p_{data}(x, y)\) remains unknown.
The easiest way is to turn it to empirical risk minimization:
\[ \mathbb{E}_{(x,y)\sim \hat{p}_{data}}[L(f(x;\theta),y)]=\frac{1}{m}\sum_{i=1}^m L(f(x^i;\theta),y^i) \tag{8.3} \]
However, some useful loss functions cannot be directly used in deep learning. For example, the 0–1 loss is discrete and non-differentiable, which makes it impossible to optimize using gradient-based methods. In such cases, we replace it with a proxy (or surrogate) loss function.
Proxy Loss Function and Early Stop
Motivation
Some loss functions that truly reflect the model’s performance (such as the 0–1 loss for classification accuracy) cannot be optimized efficiently — they are discrete, non-differentiable, and sometimes even NP-hard to minimize.
Therefore, in practice, we replace them with surrogate (proxy) loss functions that are smooth and differentiable, making them compatible with gradient-based optimization.
Surrogate Loss Functions
A surrogate loss serves as a stand-in objective for the true but intractable loss. It not only makes optimization feasible but often brings useful properties such as smoother gradients or better generalization.
Generated by ChatGPT:
| True Objective | Surrogate Loss Used | Notes |
|---|---|---|
| 0–1 loss (classification error) | Logistic loss / Cross-Entropy | differentiable, smooth, convex |
| Margin-based classification | Hinge loss (SVM) | encourages large margins |
| Likelihood maximization | Negative log-likelihood | probabilistic interpretation |
Batch and Mini-Batch Algorithm
In machine learning, the optimization algorithm updates the parameters using an estimate of the expectation over a batch of training samples.
For example, the maximum likelihood estimation:
\[ \theta_{ML}=\arg\max_{\theta}\sum_{i=1}^m \log p_{model}(x^i,y^i;\theta) \tag{8.4} \]
Maximizing the true expectation is approximated by maximizing the expectation over the empirical training data:
\[ J(\theta)=\mathbb{E}_{x,y \sim \hat{p}_{data}}\log p_{model}(x,y;\theta) \tag{8.5} \]
The gradient is:
\[ \nabla_{\theta} J(\theta)=\mathbb{E}_{x,y \sim \hat{p}_{data}}\nabla_{\theta} \log p_{model}(x,y;\theta) \tag{8.6} \]
It is hard to calculate the expectation over the entire training data. In practice, we can use stochastic methods to sample a mini-batch.
Terms
- Batch/deterministic: use the entire training data
- Stochastic/online: use 1 single sample
- Mini-batch/minibatch stochastic: use >1 but not entire training data
Why Mini-Batch Works
1. Trade-off Between Computation and Variance
- Larger batches provide more accurate gradient estimates but require more computation.
- The improvement is not linear — the standard deviation of the mean decreases as:
\[ \text{Std}(\bar{Z}) = \frac{\sigma}{\sqrt{n}} \]
- For example, if one approach uses 10,000 samples and another uses 100 samples, the first requires 100× more computation but only reduces the standard deviation by 10×.
2. Hardware Efficiency
- Mini-batch processing naturally fits parallel hardware such as GPUs, TPUs, or multi-core CPUs.
- Using batch sizes that are powers of two (e.g., \(2^n\)) often improves memory alignment and throughput.
3. Regularization Effect
- Mini-batch stochasticity introduces noise into gradient estimates.
- This noise can serve as a form of implicit regularization, helping the model avoid overfitting and generalize better.
Summary: Mini-batches strike a balance between computational efficiency and statistical efficiency — they make optimization faster while preserving enough stochasticity to improve generalization.
Problems of Hessian Matrix
Second-order methods that use the Hessian matrix \(H\) (such as Newton’s method) require computing and inverting \(H^{-1}g\), which demands very large batches to obtain accurate curvature estimates.
However, even small estimation errors in \(H\) can be amplified when multiplied by its inverse, making parameter updates highly unstable—especially under noisy minibatch conditions.
Because of this high computational cost and sensitivity to noise, deep learning rarely uses second-order methods, relying instead on first-order algorithms like SGD or Adam for more stable and efficient optimization.
Stochastic Gradient Descent as an Unbiased Estimator
When both x and y are discrete variables, the expected loss over the data distribution can be conveniently expressed as a double summation over all possible pairs (x, y).
In this case, the generalization objective \(J^*(\theta)\) becomes the weighted average of the loss \(L(f(x; \theta), y)\), where the weights are given by the data distribution \(p_{\text{data}}(x, y)\).
This formulation directly connects the theoretical expectation with a computable form — leading to equations (8.7) and (8.8), which explicitly write out the expected objective and its gradient under discrete data assumptions.
\[ J^*(\theta)=\sum_x\sum_yp_{data}(x,y)L(f(x;\theta),y) \tag{8.7} \]
\[ g=\nabla_{\theta}J^*(\theta)=\sum_x\sum_yp_{data}(x,y)\nabla_{\theta} L(f(x;\theta),y) \tag{8.8} \]
\[ \hat{g}=\frac{1}{m}\nabla_{\theta}\sum_iL(f(x^i;\theta),y^i) \tag{8.9} \]
For continuous x and y, similar results can be obtained by replacing the sums with integrals.
Source: Deep Learning (Ian Goodfellow, Yoshua Bengio, Aaron Courville) - Chapter 8.1