Deep Learning Book Chapter 7.2: Constrained Optimization View of Regularization

Deep Learning
Optimization
Regularization
Author

Chao Ma

Published

October 14, 2025

Context

My lecture notes

Regularization can be viewed as either adding a penalty term or enforcing a constraint. This post shows the equivalence between these two perspectives using Lagrange multipliers and dual optimization.


From Penalty to Constraint

Regularization can be viewed from two equivalent perspectives:

Penalty Form (Unconstrained)

Formula 7.25: \[ \tilde{J}(\theta; X, y) = J(\theta; X, y) + \alpha \Omega(\theta) \]

This adds a penalty term \(\alpha \Omega(\theta)\) to the original loss, where \(\alpha\) controls the strength of regularization.

Constraint Form (Lagrangian)

We can equivalently express regularization as a constrained optimization problem requiring \(\Omega(\theta) \leq k\).

Formula 7.26 (Lagrangian): \[ \mathcal{L}(\theta, \alpha; X, y) = J(\theta; X, y) + \alpha(\Omega(\theta) - k) \]

where \(\alpha \geq 0\) is the Lagrange multiplier.

Formula 7.27 (Optimization Problem): \[ \theta^* = \arg\min_{\theta} \max_{\alpha \geq 0} \mathcal{L}(\theta, \alpha) \]

This is equivalent to the constrained optimization: \[ \theta^* = \arg\min_{\theta} J(\theta; X, y) \quad \text{subject to} \quad \Omega(\theta) \leq k \]

Interpretation: The solution corresponds to finding parameters \(\theta\) that minimize the loss while satisfying the constraint. When \(\alpha\) is large, it strongly enforces the constraint \(\Omega(\theta) \leq k\).

Lagrange Multiplier Method

The Lagrangian formulation transforms the problem into a min-max optimization: \[ \min_{\theta} \max_{\alpha \geq 0} \mathcal{L}(\theta, \alpha) \]

Dual Direction Training

Instead of a single gradient descent on \(\min \mathcal{L}(\theta)\), we now have two opposing optimization directions:

\[ \begin{aligned} \theta &\downarrow \quad \text{(minimize w.r.t. } \theta\text{)} \\ \alpha &\uparrow \quad \text{(maximize w.r.t. } \alpha\text{)} \end{aligned} \]

Training Dynamics

The training process balances two forces:

  1. When \(\Omega(\theta) > k\) (constraint violated):
    • \(\alpha\) increases to penalize the violation
    • Larger \(\alpha\) pushes \(\theta\) toward smaller norms
    • This enforces the constraint
  2. When \(\Omega(\theta) < k\) (constraint satisfied):
    • \(\alpha\) decreases toward 0
    • The constraint is not active
    • Optimization focuses on minimizing \(J(\theta)\)

Saddle Point Solution

The training converges to a saddle point satisfying the KKT (Karush-Kuhn-Tucker) conditions:

\[ \begin{aligned} \nabla_{\theta} \mathcal{L}(\theta, \alpha) &= 0 \quad \text{(stationarity)} \\ \alpha(\Omega(\theta) - k) &= 0 \quad \text{(complementary slackness)} \\ \alpha &\geq 0 \quad \text{(dual feasibility)} \end{aligned} \]

The complementary slackness condition means: - Either \(\alpha = 0\) (constraint inactive, \(\Omega(\theta) < k\)) - Or \(\Omega(\theta) = k\) (constraint active, \(\alpha > 0\))

Penalty vs Projection Methods

Weight Norm Penalties (Soft Constraint)

When using weight norm penalties such as L¹ or L² regularization: - The penalty term \(\alpha \Omega(\theta)\) provides a “soft” constraint - The optimal solution may be locally optimal for the regularized objective \(\tilde{J}(\theta)\) - Even if increasing weights could reduce the original loss \(J(\theta)\), the penalty prevents this - The regularization strength \(\alpha\) determines how strictly the constraint is enforced

Explicit Constraints (Hard Constraint)

In contrast, explicit constraint or projection methods enforce \(\Omega(\theta) \leq k\) directly: - Provide a hard boundary on the weight norm - After each gradient step, project \(\theta\) back onto the constraint set if needed - Often lead to more stable optimization with clearer geometric interpretation - The constraint is always exactly satisfied (not approximately)

Trade-offs

Method Constraint Type Stability Flexibility
Penalty (L¹/L²) Soft Good High (tune \(\alpha\))
Projection Hard Very Good Lower (fixed \(k\))

Key insight: Both approaches are equivalent in theory (there exists a correspondence between \(\alpha\) and \(k\)), but in practice they may have different optimization properties and convergence behavior.


Source: Deep Learning Book, Chapter 7.2