Goodfellow Deep Learning — Deep Learning Book Chapter 7.1.1: L2 Regularization

Deep Learning

Optimization

Regularization

Author

Chao Ma

Published

October 13, 2025

Context

My lecture notes

L2 regularization (ridge regression) adds a penalty term to the loss function to prevent overfitting. This post walks through the math behind how L2 regularization affects the optimal weights, using eigenvalue decomposition to show that it shrinks weights differently in different directions based on the Hessian’s curvature.

Three Unproven Theorems

These theorems are used directly here without proof. Note: Proofs will be provided in an additional blog post; here we just use them.

$H = Q\Lambda Q^T$
$[QAQ^T]^{-1} = (Q^T)^{-1}A^{-1}Q^{-1} = QA^{-1}Q^T$
- When Q is an orthogonal matrix, $Q^T = Q^{-1}$
The loss from w to w* is $\frac{1}{2}(w-w^*)^TH(w-w^*)$
- This is similar to the kinematic formula $s = \frac{1}{2}at^2$, with two key differences:
  - Here the dimension is not time, but the displacement from w to w*
  - Here the dimension is not the 0-dimensional t, but w which is a vector with dimensions, so we use H

Formula 7.1

\[ \tilde{J}(\theta; x, y) = J(\theta; x, y) + \alpha \Omega(\theta) \]

Total objective including regularization.

Formula 7.2

Here $w^Tw$ represents the L2 norm of parameters, $\alpha$ represents the penalty coefficient, between 0-1, and relatively close to 0.

\[ \tilde{J}(w; \theta; y) = \frac{\alpha}{2}w^Tw + J(w; X; y) \]

Formula 7.3

\[ \nabla \tilde{J}(w; X; y) = \alpha w + \nabla_w J(w; X; y) \]

To understand this formula, the key is understanding the gradient of $w^Tw$.

In the one-dimensional world, if we have $f(x) = x^2$, which is equivalent to L2, then the gradient is $f'(x) = 2x$.

Derivation: \[ \begin{align} (x + \Delta x)^2 &= x^2 + 2x\Delta x + \Delta x^2 \\ (x + \Delta x)^2 - x^2 &= 2x\Delta x + \Delta x^2 \\ \frac{f(x) - f(\Delta x)}{\Delta x} &= 2x + \Delta x \end{align} \]

Extending to higher dimensions, $\nabla(w^Tw) = 2w$, so the gradient of $\frac{\alpha}{2}w^Tw$ is $\alpha w$.

Formula 7.4 & 7.5

This is straightforward to understand: $\epsilon$ is the learning rate, or step size.

\[ w \leftarrow w - \epsilon (\alpha w + \nabla J(w; X; y)) \]

Formula 7.6

This formula makes a quadratic approximation in the neighborhood of w* (the optimal solution) without introducing L2, yielding:

\[ \hat{J}(\theta) = J(w^*) + \frac{1}{2}(w - w^*)^T H(w - w^*) \]

This can actually be seen as L2 weighted by the Hessian matrix. Because the curvature differs in each direction, we multiply the distance in each direction by the curvature in that direction to get a weighted squared distance.

To understand this, we need the Taylor expansion. For one dimension: \[ f(x) \approx f(x^*) + f'(x)(x - x^*) + \frac{1}{2}f''(x)(x - x^*)^2 \]

For second order: \[ J(w) \approx J(w^*) + (w - w^*)^\top \nabla J(w^*) + \frac{1}{2}(w - w^*)^\top H(w - w^*) \]

Since w* is the optimal solution, the gradient is 0, so we can directly remove $(w - w^*)^\top \nabla J(w^*)$.

This directly gives us formula 7.6.

Formula 7.7

\[ \nabla \hat{J}(w) = H(w - w^*) \]

Formula 7.7 is the derivative of formula 7.6. Since $J(w^*)$ is a constant with gradient 0, we only need to find the gradient of $\frac{1}{2}(w - w^*)^T H(w - w^*)$, which is $(w - w^*)$ times H (can be derived from one dimension).

Formula 7.8/7.9/7.10

Formula 7.8 combines 7.3 and 7.7: \[ \alpha \tilde{w} + H(\tilde{w} - w^*) = 0 \]

From 7.3, we already have: \[ \nabla J(w; X; y) = \alpha w + \nabla_w J(w; X; y) \]

Substituting into 7.7 gives us 7.8.

Formula 7.9 is a transformation of 7.8: \[ \begin{align} \alpha \tilde{w} + H\tilde{w} &= Hw^* \\ (H + \alpha I)\tilde{w} &= Hw^* \end{align} \]

Multiplying both sides by the inverse of $(H + \alpha I)$, the left side becomes just $\tilde{w}$: \[ \tilde{w} = (H + \alpha I)^{-1} Hw^* \]

When $\alpha \approx 0$, $\alpha I \approx \mathbf{0}$, so $\tilde{w} \approx HH^{-1}w^* \approx w^*$.

Formula 7.11 & 7.12 & 7.13

We transform H to $H = Q\Lambda Q^T$, since it’s real symmetric, we can do this transformation.

We get formula 7.11: \[ \tilde{w} = (Q\Lambda Q^T + \alpha I)^{-1} Q\Lambda Q^T w^* \]

Replace I with $QQ^T$: \[ Q\Lambda Q^T + \alpha QQ^T \]

The common factors are Q and $Q^T$, $\Lambda$ can be added with $\alpha$ because $\alpha$ is a constant multiplied by I.

This gives us formula 7.12: \[ \tilde{w} = [Q(\Lambda + \alpha I)Q^T]^{-1} Q\Lambda Q^T w^* \]

Substituting $[QAQ^T]^{-1} = QA^{-1}Q^T$: \[ \tilde{w} = Q(\Lambda + \alpha I)^{-1} \Lambda Q^T w^* \]

Transform this formula to $\frac{\lambda_i}{\lambda_i + \alpha}$:

$\Lambda = \operatorname{diag}(\lambda_1, \dots, \lambda_n), \quad \Lambda + \alpha I = \operatorname{diag}(\lambda_1 + \alpha, \dots, \lambda_n + \alpha)$
$(\Lambda + \alpha I)^{-1} = \operatorname{diag}\!\Big(\tfrac{1}{\lambda_1 + \alpha}, \dots, \tfrac{1}{\lambda_n + \alpha}\Big)$
$(\Lambda + \alpha I)^{-1}\Lambda = \operatorname{diag}\!\Big(\tfrac{1}{\lambda_i + \alpha}\Big) \, \operatorname{diag}(\lambda_i) = \operatorname{diag}\!\Big(\tfrac{\lambda_i}{\lambda_i + \alpha}\Big)$
So: \[ \tilde{w} = Q \operatorname{diag}\Big(\frac{\lambda_i}{\lambda_i + \alpha}\Big) Q^T w^* \]

This transformation tells us: in directions with large eigenvalues, w is preserved more; in directions with small eigenvalues, w is preserved less.

Contour Lines

Understanding this figure:

Solid lines are the true contour lines of the loss function. These ellipses represent contour curves under different loss scenarios. On each contour curve, each point means we need to move in the direction perpendicular to the tangent, otherwise there’s no change (gradient).
Dashed lines are the regularized loss. Because it’s a scalar, it affects all directions equally, so the contour lines are circles.
When the first ellipse of the solid line intersects with a circle layer of the dashed line, that point achieves balance for both (gradients are opposite in direction, equal in magnitude).
Our ellipse shows that change along the x-axis is larger, y-axis is smaller, proving that for equal loss, we need to move more along x, less along y.
At the intersection point, our regularization curve pulls up more along the x-axis, less along the y-axis, proving that the y-axis is preserved better, while x has relatively large adjustments.

The ellipse shows the sensitivity of the loss function in different directions. The circle shows uniform penalty of regularization in all directions. The point where they are tangent has balanced gradients and minimum loss, which is the optimal solution for ridge regression— shrinking more in unstable directions, less in stable directions.

Source: Deep Learning Book, Chapter 7.1.1

--- title: "Goodfellow Deep Learning — Deep Learning Book Chapter 7.1.1: L2 Regularization" author: "Chao Ma" date: "2025-10-13" categories: ["Deep Learning", "Optimization", "Regularization"] --- ## Context [My lecture notes](https://github.com/ickma2311/foundations/blob/main/deep_learning/chapter7/7.1/7.1.1_L2_regularization.md) L2 regularization (ridge regression) adds a penalty term to the loss function to prevent overfitting. This post walks through the math behind how L2 regularization affects the optimal weights, using eigenvalue decomposition to show that it shrinks weights differently in different directions based on the Hessian's curvature. --- ## Three Unproven Theorems These theorems are used directly here without proof. **Note:** Proofs will be provided in an additional blog post; here we just use them. 1. $H = Q\Lambda Q^T$ 2. $[QAQ^T]^{-1} = (Q^T)^{-1}A^{-1}Q^{-1} = QA^{-1}Q^T$ - When Q is an orthogonal matrix, $Q^T = Q^{-1}$ 3. The loss from w to w* is $\frac{1}{2}(w-w^*)^TH(w-w^*)$ - This is similar to the kinematic formula $s = \frac{1}{2}at^2$, with two key differences: - Here the dimension is not time, but the displacement from w to w* - Here the dimension is not the 0-dimensional t, but w which is a vector with dimensions, so we use H --- ## Formula 7.1 $$ \tilde{J}(\theta; x, y) = J(\theta; x, y) + \alpha \Omega(\theta) $$ Total objective including regularization. ## Formula 7.2 Here $w^Tw$ represents the L2 norm of parameters, $\alpha$ represents the penalty coefficient, between 0-1, and relatively close to 0. $$ \tilde{J}(w; \theta; y) = \frac{\alpha}{2}w^Tw + J(w; X; y) $$ ## Formula 7.3 $$ \nabla \tilde{J}(w; X; y) = \alpha w + \nabla_w J(w; X; y) $$ To understand this formula, the key is understanding the gradient of $w^Tw$. In the one-dimensional world, if we have $f(x) = x^2$, which is equivalent to L2, then the gradient is $f'(x) = 2x$. Derivation: $$ \begin{align} (x + \Delta x)^2 &= x^2 + 2x\Delta x + \Delta x^2 \\ (x + \Delta x)^2 - x^2 &= 2x\Delta x + \Delta x^2 \\ \frac{f(x) - f(\Delta x)}{\Delta x} &= 2x + \Delta x \end{align} $$ Extending to higher dimensions, $\nabla(w^Tw) = 2w$, so the gradient of $\frac{\alpha}{2}w^Tw$ is $\alpha w$. ## Formula 7.4 & 7.5 This is straightforward to understand: $\epsilon$ is the learning rate, or step size. $$ w \leftarrow w - \epsilon (\alpha w + \nabla J(w; X; y)) $$ ## Formula 7.6 This formula makes a quadratic approximation in the neighborhood of w* (the optimal solution) without introducing L2, yielding: $$ \hat{J}(\theta) = J(w^*) + \frac{1}{2}(w - w^*)^T H(w - w^*) $$ This can actually be seen as L2 weighted by the Hessian matrix. Because the curvature differs in each direction, we multiply the distance in each direction by the curvature in that direction to get a weighted squared distance. To understand this, we need the Taylor expansion. For one dimension: $$ f(x) \approx f(x^*) + f'(x)(x - x^*) + \frac{1}{2}f''(x)(x - x^*)^2 $$ For second order: $$ J(w) \approx J(w^*) + (w - w^*)^\top \nabla J(w^*) + \frac{1}{2}(w - w^*)^\top H(w - w^*) $$ Since w* is the optimal solution, the gradient is 0, so we can directly remove $(w - w^*)^\top \nabla J(w^*)$. This directly gives us formula 7.6. ## Formula 7.7 $$ \nabla \hat{J}(w) = H(w - w^*) $$ Formula 7.7 is the derivative of formula 7.6. Since $J(w^*)$ is a constant with gradient 0, we only need to find the gradient of $\frac{1}{2}(w - w^*)^T H(w - w^*)$, which is $(w - w^*)$ times H (can be derived from one dimension). ## Formula 7.8/7.9/7.10 Formula 7.8 combines 7.3 and 7.7: $$ \alpha \tilde{w} + H(\tilde{w} - w^*) = 0 $$ From 7.3, we already have: $$ \nabla J(w; X; y) = \alpha w + \nabla_w J(w; X; y) $$ Substituting into 7.7 gives us 7.8. Formula 7.9 is a transformation of 7.8: $$ \begin{align} \alpha \tilde{w} + H\tilde{w} &= Hw^* \\ (H + \alpha I)\tilde{w} &= Hw^* \end{align} $$ Multiplying both sides by the inverse of $(H + \alpha I)$, the left side becomes just $\tilde{w}$: $$ \tilde{w} = (H + \alpha I)^{-1} Hw^* $$ When $\alpha \approx 0$, $\alpha I \approx \mathbf{0}$, so $\tilde{w} \approx HH^{-1}w^* \approx w^*$. ## Formula 7.11 & 7.12 & 7.13 We transform H to $H = Q\Lambda Q^T$, since it's real symmetric, we can do this transformation. We get formula 7.11: $$ \tilde{w} = (Q\Lambda Q^T + \alpha I)^{-1} Q\Lambda Q^T w^* $$ Replace I with $QQ^T$: $$ Q\Lambda Q^T + \alpha QQ^T $$ The common factors are Q and $Q^T$, $\Lambda$ can be added with $\alpha$ because $\alpha$ is a constant multiplied by I. This gives us formula 7.12: $$ \tilde{w} = [Q(\Lambda + \alpha I)Q^T]^{-1} Q\Lambda Q^T w^* $$ Substituting $[QAQ^T]^{-1} = QA^{-1}Q^T$: $$ \tilde{w} = Q(\Lambda + \alpha I)^{-1} \Lambda Q^T w^* $$ **Transform this formula to $\frac{\lambda_i}{\lambda_i + \alpha}$:** 1. $\Lambda = \operatorname{diag}(\lambda_1, \dots, \lambda_n), \quad \Lambda + \alpha I = \operatorname{diag}(\lambda_1 + \alpha, \dots, \lambda_n + \alpha)$ 2. $(\Lambda + \alpha I)^{-1} = \operatorname{diag}\!\Big(\tfrac{1}{\lambda_1 + \alpha}, \dots, \tfrac{1}{\lambda_n + \alpha}\Big)$ 3. $(\Lambda + \alpha I)^{-1}\Lambda = \operatorname{diag}\!\Big(\tfrac{1}{\lambda_i + \alpha}\Big) \, \operatorname{diag}(\lambda_i) = \operatorname{diag}\!\Big(\tfrac{\lambda_i}{\lambda_i + \alpha}\Big)$ 4. So: $$ \tilde{w} = Q \operatorname{diag}\Big(\frac{\lambda_i}{\lambda_i + \alpha}\Big) Q^T w^* $$ This transformation tells us: **in directions with large eigenvalues, w is preserved more; in directions with small eigenvalues, w is preserved less.** --- ## Contour Lines ![L2 Regularization Contour Lines](https://github.com/ickma2311/foundations/blob/main/deep_learning/chapter7/7.1/l2_contour.png?raw=true) **Understanding this figure:** - **Solid lines** are the true contour lines of the loss function. These ellipses represent contour curves under different loss scenarios. On each contour curve, each point means we need to move in the direction perpendicular to the tangent, otherwise there's no change (gradient). - **Dashed lines** are the regularized loss. Because it's a scalar, it affects all directions equally, so the contour lines are circles. - When the first ellipse of the solid line intersects with a circle layer of the dashed line, that point achieves balance for both (gradients are opposite in direction, equal in magnitude). - Our ellipse shows that change along the x-axis is larger, y-axis is smaller, proving that for equal loss, we need to move more along x, less along y. - At the intersection point, our regularization curve pulls up more along the x-axis, less along the y-axis, proving that the y-axis is preserved better, while x has relatively large adjustments. > The ellipse shows the sensitivity of the loss function in different directions. > The circle shows uniform penalty of regularization in all directions. > The point where they are tangent has balanced gradients and minimum loss, > which is the optimal solution for ridge regression— > shrinking more in unstable directions, less in stable directions. --- *Source: Deep Learning Book, Chapter 7.1.1*