Deep Learning Chapter 8.1: How Learning Differs from Pure Optimization

deep learning
optimization
empirical risk minimization
mini-batch
Why machine learning optimization is fundamentally different from pure optimization and why mini-batch methods work
Author

Chao Ma

Published

November 4, 2025

Pure Optimization vs. Machine Learning

Pure Optimization

The objective of pure optimization:

\[ \min_{\theta}J(\theta) \]

In pure optimization, the function is determined and usually has a solution. For example, \(\min f(x)=(x-3)^2\) has solution \(x=3\).

Machine Learning

In machine learning, we optimize the empirical risk:

\[ J(\theta)=\mathbb{E}_{(x,y)\sim \hat{p}_{data}}L(f(x;\theta),y) \tag{8.1} \]

where: - \(L\) is the loss function - \(f(x;\theta)\) is the model output - \(\hat{p}_{data}\) is the empirical distribution on training data

In machine learning, our true goal is to minimize the expected loss on the true distribution, but we can only minimize the empirical loss on the training set. Thus, the optimization problem is only an approximation — and it often has no exact solution.

\(\hat{p}_{data}\) approximates the true data distribution \(p_{data}\) by assigning equal probability to each training sample.

Expected Risk

Generally, our ultimate goal is to minimize the expected loss under the true data distribution \(p_{data}\), rather than the empirical distribution \(\hat{p}_{data}\) estimated from the training data.

\[ J^*(\theta)=\mathbb{E}_{(x,y)\sim {p}_{data}}L(f(x;\theta),y) \tag{8.2} \]


Empirical Risk Minimization

The challenge in machine learning is that we only have access to the empirical distribution \(\hat{p}_{data}(x, y)\) derived from the training set, but the true data distribution \(p_{data}(x, y)\) remains unknown.

The easiest way is to turn it to empirical risk minimization:

\[ \mathbb{E}_{(x,y)\sim \hat{p}_{data}}[L(f(x;\theta),y)]=\frac{1}{m}\sum_{i=1}^m L(f(x^i;\theta),y^i) \tag{8.3} \]

However, some useful loss functions cannot be directly used in deep learning. For example, the 0–1 loss is discrete and non-differentiable, which makes it impossible to optimize using gradient-based methods. In such cases, we replace it with a proxy (or surrogate) loss function.


Proxy Loss Function and Early Stop

Motivation

Some loss functions that truly reflect the model’s performance (such as the 0–1 loss for classification accuracy) cannot be optimized efficiently — they are discrete, non-differentiable, and sometimes even NP-hard to minimize.

Therefore, in practice, we replace them with surrogate (proxy) loss functions that are smooth and differentiable, making them compatible with gradient-based optimization.

Surrogate Loss Functions

A surrogate loss serves as a stand-in objective for the true but intractable loss. It not only makes optimization feasible but often brings useful properties such as smoother gradients or better generalization.

Generated by ChatGPT:

True Objective Surrogate Loss Used Notes
0–1 loss (classification error) Logistic loss / Cross-Entropy differentiable, smooth, convex
Margin-based classification Hinge loss (SVM) encourages large margins
Likelihood maximization Negative log-likelihood probabilistic interpretation

Batch and Mini-Batch Algorithm

In machine learning, the optimization algorithm updates the parameters using an estimate of the expectation over a batch of training samples.

For example, the maximum likelihood estimation:

\[ \theta_{ML}=\arg\max_{\theta}\sum_{i=1}^m \log p_{model}(x^i,y^i;\theta) \tag{8.4} \]

Maximizing the true expectation is approximated by maximizing the expectation over the empirical training data:

\[ J(\theta)=\mathbb{E}_{x,y \sim \hat{p}_{data}}\log p_{model}(x,y;\theta) \tag{8.5} \]

The gradient is:

\[ \nabla_{\theta} J(\theta)=\mathbb{E}_{x,y \sim \hat{p}_{data}}\nabla_{\theta} \log p_{model}(x,y;\theta) \tag{8.6} \]

It is hard to calculate the expectation over the entire training data. In practice, we can use stochastic methods to sample a mini-batch.

Terms

  • Batch/deterministic: use the entire training data
  • Stochastic/online: use 1 single sample
  • Mini-batch/minibatch stochastic: use >1 but not entire training data

Why Mini-Batch Works

1. Trade-off Between Computation and Variance

  • Larger batches provide more accurate gradient estimates but require more computation.
  • The improvement is not linear — the standard deviation of the mean decreases as:

\[ \text{Std}(\bar{Z}) = \frac{\sigma}{\sqrt{n}} \]

  • For example, if one approach uses 10,000 samples and another uses 100 samples, the first requires 100× more computation but only reduces the standard deviation by 10×.

2. Hardware Efficiency

  • Mini-batch processing naturally fits parallel hardware such as GPUs, TPUs, or multi-core CPUs.
  • Using batch sizes that are powers of two (e.g., \(2^n\)) often improves memory alignment and throughput.

3. Regularization Effect

  • Mini-batch stochasticity introduces noise into gradient estimates.
  • This noise can serve as a form of implicit regularization, helping the model avoid overfitting and generalize better.

Summary: Mini-batches strike a balance between computational efficiency and statistical efficiency — they make optimization faster while preserving enough stochasticity to improve generalization.


Problems of Hessian Matrix

Second-order methods that use the Hessian matrix \(H\) (such as Newton’s method) require computing and inverting \(H^{-1}g\), which demands very large batches to obtain accurate curvature estimates.

However, even small estimation errors in \(H\) can be amplified when multiplied by its inverse, making parameter updates highly unstable—especially under noisy minibatch conditions.

Because of this high computational cost and sensitivity to noise, deep learning rarely uses second-order methods, relying instead on first-order algorithms like SGD or Adam for more stable and efficient optimization.


Stochastic Gradient Descent as an Unbiased Estimator

When both x and y are discrete variables, the expected loss over the data distribution can be conveniently expressed as a double summation over all possible pairs (x, y).

In this case, the generalization objective \(J^*(\theta)\) becomes the weighted average of the loss \(L(f(x; \theta), y)\), where the weights are given by the data distribution \(p_{\text{data}}(x, y)\).

This formulation directly connects the theoretical expectation with a computable form — leading to equations (8.7) and (8.8), which explicitly write out the expected objective and its gradient under discrete data assumptions.

\[ J^*(\theta)=\sum_x\sum_yp_{data}(x,y)L(f(x;\theta),y) \tag{8.7} \]

\[ g=\nabla_{\theta}J^*(\theta)=\sum_x\sum_yp_{data}(x,y)\nabla_{\theta} L(f(x;\theta),y) \tag{8.8} \]

\[ \hat{g}=\frac{1}{m}\nabla_{\theta}\sum_iL(f(x^i;\theta),y^i) \tag{8.9} \]

For continuous x and y, similar results can be obtained by replacing the sums with integrals.


Source: Deep Learning (Ian Goodfellow, Yoshua Bengio, Aaron Courville) - Chapter 8.1