MIT 18.065: Lecture 22 - Gradient Descent

Linear Algebra
Optimization
Gradient Descent
Hessian
Convexity
Condition Number
Author

Chao Ma

Published

February 27, 2026

Gradient descent is a fundamental algorithm for minimizing a function, but its behavior is controlled by curvature, step size, and condition number.

A Quadratic Matrix

\[ f(v)=\frac{1}{2}v^\top S v=\frac{1}{2}(x^2+b y^2), \qquad S=\begin{bmatrix}1&0\\0&b\end{bmatrix}, \qquad v=\begin{bmatrix}x\\y\end{bmatrix}. \]

This is the standard bowl model for analyzing gradient methods.

Gradient

For a linear function \[ f(x,y)=2x+5y, \] its gradient is \[ \nabla f=\begin{bmatrix}2\\5\end{bmatrix}, \] and the Hessian is zero: \[ H=\nabla^2 f=\begin{bmatrix}0&0\\0&0\end{bmatrix}. \]

So linear functions have constant slope and no curvature.

Hessian and Convexity

Definitions

  • Convex: Hessian is positive semidefinite (PSD).
  • Strictly convex: Hessian is positive definite (PD).

In eigenvalue language:

  • PSD means \(\lambda_i\ge 0\) for all \(i\) (flat directions are allowed).
  • PD means \(\lambda_i>0\) for all \(i\) (curvature in every direction).

Consequences:

  • PSD convex objective: global minima exist, but may not be unique.
  • PD objective: global minimum is unique.

Hessian view: convex vs non-convex geometry

Example

\[ f(x)=\frac{1}{2}x^\top Sx-a^\top x-b. \]

Then

  • \(\nabla f = Sx-a\)
  • \(H=\nabla^2 f=S\)

The minimizer solves \[ Sx-a=0 \quad\Rightarrow\quad x^*=S^{-1}a, \] assuming \(S\) is invertible.

A Remarkable Convex Function

\[ f(X)=-\log\det(X),\qquad X\in\mathbb{S}_{++}^n. \]

A key identity from matrix calculus is \[ \frac{\partial \log\det X}{\partial X_{ij}}=(X^{-1})_{ji}, \] so \[ \nabla_X\big(-\log\det X\big)=-X^{-\top}. \]

This function appears throughout optimization (barrier methods, covariance estimation, SDP-type models).

Gradient Descent

\[ x_{k+1}=x_k-s_k\nabla f(x_k), \] where \(s_k\) is the step size (learning rate).

Gradient descent update and learning-rate effect

Line Search Strategies

Condition Number and Reduction

For quadratic objectives, convergence speed is governed by the condition number \[ \kappa=\frac{M}{m}, \] where \(M\) and \(m\) are largest/smallest Hessian eigenvalues.

  • Well-conditioned (\(\kappa\approx 1\)): level sets are close to circles, gradients point toward the minimizer, and convergence is fast.
  • Ill-conditioned (\(\kappa\gg 1\)): level sets are elongated, iterates zigzag across valleys, and convergence is slow.

A classical reduction factor with exact line search behaves like \[ \rho\approx\frac{\kappa-1}{\kappa+1}, \] so larger \(\kappa\) means slower progress per step.


Takeaway. Gradient descent performance is not just about choosing a learning rate. The geometry of the objective (through Hessian eigenvalues and condition number) is the central factor determining whether optimization is smooth or painfully slow.