MIT 18.065: Linear Algebra Applications

Author

Chao Ma

Published

November 29, 2025

All Lectures

My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning. This course explores how linear algebra powers modern applications in data analysis, signal processing, and machine learning.

Lecture 22: Gradient Descent Gradient descent through quadratic geometry: Hessian-based convexity, line-search step-size rules, and why condition number controls zigzagging versus fast convergence.

Lecture 21: Minimizing a Function Taylor expansion introduces Hessian/Jacobian roles, Newton method updates for root finding and optimization, quadratic convergence intuition, and why full Newton is expensive in deep learning; ends with convex-set/function conditions and global-minimum guarantees.

Lecture 1: The Column Space of A Contains All Vectors Ax How column combinations define $C(A)$, why rank counts independent columns, and how CR factorization exposes the basis for column and row space. A column-by-row view of $AB$ shows every column of $AB$ lies in $C(A)$.

Lecture 2: Multiplying and Factoring Matrices From elimination and $LU$ to rank-1 updates, $QR$, spectral decomposition, diagonalization, and SVD. Ends with the four fundamental subspaces and a null space example.

Lecture 3: Orthonormal Columns Orthogonal matrices with $Q^\top Q = I$: rotation matrices that preserve length, reflection matrices (including Householder reflections $H = I - 2uu^\top$), Hadamard matrices built recursively for signal processing, and wavelets that capture multiscale structure. Eigenvectors of symmetric matrices are orthogonal—culminating in discrete Fourier transforms.

Lecture 4: Eigenvalues and Eigenvectors Eigenvectors $Ax = \lambda x$ preserve direction under transformation. Key properties: $A^k x = \lambda^k x$, solving difference equations $v_{k+1} = Av_k$ and differential equations. Similar matrices $B = M^{-1}AM$ share eigenvalues. Symmetric matrices have real eigenvalues and orthogonal eigenvectors, enabling spectral decomposition $S = Q\Lambda Q^\top$.

Lecture 5: Positive Definite and Semidefinite Matrices Positive definite matrices have all $\lambda_i > 0$ and energy $x^\top Sx > 0$ for all $x \ne 0$. Factorize as $S = AA^\top$ with full rank A. The quadratic form creates an energy bowl—convex and minimizable via gradient descent. This connects linear algebra to optimization: deep learning minimizes loss landscapes just as we minimize $x^\top Sx$. Positive semidefinite relaxes to $\lambda_i \ge 0$.

Lecture 6: Singular Value Decomposition (SVD) SVD factorizes any $m \times n$ matrix as $A = U\Sigma V^\top$: rotate-scale-rotate. Compute U from eigenvectors of $AA^\top$, V from eigenvectors of $A^\top A$, and $\Sigma$ from square roots of shared eigenvalues. Singular values relate to eigenvalues: $\det A = \prod \sigma_i = \prod \lambda_i$ for square matrices. Polar decomposition $A = (U\Sigma U^\top)(UV^\top)$ splits any matrix into symmetric positive semidefinite × orthogonal. Exercises cover computing SVD and understanding quadratic forms through eigendecomposition.

Lecture 7: Eckart-Young Theorem - The Closest Rank k Matrix The Eckart-Young theorem proves that truncated SVD $A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$ gives the best rank-$k$ approximation: $\|A - B\| \ge \|A - A_k\|$ for any rank-$k$ matrix $B$. Matrix norms (spectral, Frobenius, nuclear) measure approximation quality. Real-world applications: Netflix collaborative filtering uses low-rank matrices to predict movie ratings, MRI reconstruction exploits natural image structure. PCA finds best low-dimensional subspace by computing the dominant singular vectors of centered data.

Lecture 8: Norms of Vectors and Matrices Understanding vector p-norms ($\|v\|_p$) and when they satisfy the triangle inequality (only for $p \geq 1$). The “$\frac{1}{2}$-norm” creates non-convex unit balls for strong sparsity. S-norms $\sqrt{v^T S v}$ generalize to ellipses. Matrix norms: spectral $\|A\|_2 = \sigma_1$, Frobenius $\|A\|_F = \sqrt{\sum \sigma_i^2}$, nuclear $\|A\|_N = \sum \sigma_i$—all expressed via singular values.

Lecture 9: Four Ways to Solve Least Squares Problems Four equivalent methods for solving $Ax = b$ when $A$ has no inverse: pseudo-inverse $\hat{x} = A^+ b$ using SVD, normal equations $A^T A \hat{x} = A^T b$, algebraic minimization of $\|Ax - b\|_2^2$, and geometric projection of $b$ onto $C(A)$. All methods converge to the same solution when $A^T A$ is invertible: $\hat{x} = (A^T A)^{-1}A^T b$.

Lecture 10: Survey of Difficulties of Ax=b Comprehensive survey of challenges in solving $Ax=b$: overdetermined systems require least squares, underdetermined systems select minimum-norm solutions via $A^+b$, ill-conditioned problems add ridge penalty $\min ||Ax-b||^2+\delta^2||x||^2$, large systems use Krylov methods, and massive-scale problems employ randomized linear algebra. Moore-Penrose pseudo-inverse unifies all regimes: exact inverse when square, least-squares when overdetermined, minimum-norm when underdetermined, and the zero-regularization limit of ridge regression. Deep learning operates in underdetermined regime where gradient descent implicitly biases toward low-norm solutions.

Lecture 11: Minimize ||x|| subject to Ax=b Three norms produce different solution geometries: L1 promotes sparsity (diamond touches constraint at corner), L2 yields smooth isotropic solutions (circle tangency), and L∞ balances components (square equalization). QR factorization via Gram–Schmidt computes minimum-norm solutions; column pivoting improves numerical stability for nearly dependent columns. Krylov subspace methods solve large sparse systems by restricting search to low-dimensional spaces built via matrix–vector products, with Arnoldi–Lanczos orthogonalization ensuring stability.

Lecture 12: Computing Eigenvalues and Eigenvectors Eigenvalues are defined by $\det(A-\lambda I)=0$ but computed by stable similarity transformations. QR iteration transforms matrices toward triangular form where eigenvalues appear on the diagonal; shifted QR with Wilkinson shifts accelerates convergence. Reduction to Hessenberg (or tridiagonal for symmetric matrices) enables efficient iteration. For large sparse problems, Krylov subspace methods (Arnoldi/Lanczos) approximate eigenvalues using only matrix-vector products—the Abel–Ruffini theorem explains why closed-form solutions fail for $n \ge 5$.

Lecture 13: Randomized Matrix Multiplication Randomized matrix multiplication approximates $AB$ by sampling rank-1 outer products. The estimator remains unbiased, and the variance is minimized by sampling indices proportional to $\|a_j\|\,\|b_j\|$. This yields a low-rank approximation with predictable error decay in $1/\sqrt{s}$.

Lecture 14: Low Rank Changes and Their Inverse Low-rank updates reuse an existing inverse via the Sherman–Morrison–Woodbury identity. Rank-1 updates power recursive least squares; the full SMW formula updates $(A-UV^\top)^{-1}$ in terms of $A^{-1}$, avoiding a full recomputation.

Lecture 15: Matrix Derivatives and Eigenvalue Changes Differentiate the inverse with $\dot A^{-1}=-A^{-1}\dot A A^{-1}$, and track eigenvalue motion via $\dot\\lambda=y^\top \dot A x$. Low-rank PSD updates interlace eigenvalues, bounding how far each can move in the ordering.

Lecture 16: Derivative of Inverse and Singular Values Matrix derivatives require order-sensitive product rules. The singular value derivative is $\dot\sigma=u^\top \dot A v$, Weyl’s inequality bounds eigenvalue shifts, and nuclear norm minimization relaxes rank for matrix completion.

Lecture 17: Rapidly Decreasing Singular Values Rapid singular-value decay makes a matrix effectively low rank. This lecture connects numerical rank, image compression, Hilbert/Vandermonde examples, and Sylvester equations via spectral separation.

Lecture 18: Counting Parameters in SVD, LU, QR, and Saddle Points How LU/QR/eigendecomposition/SVD split degrees of freedom, with parameter checks for each case. Ends with saddle points from KKT systems and Rayleigh quotients.

Lecture 19: Saddle Point and the Max–Min Principle Stationary points of the Rayleigh quotient are eigenvectors: max, min, and saddle values. The Courant–Fischer max–min principle interprets intermediate eigenvalues. Ends with polynomial interpolation overfitting and a quick probability refresher on variance and covariance.

Lecture 20: Probability Definitions and Inequalities Variance as E[X^2]-(E[X])^2, distribution-free bounds via Markov and Chebyshev, and the bridge from joint probability tables to covariance matrices.

--- title: "MIT 18.065: Linear Algebra Applications" author: "Chao Ma" date: "2025-11-29" --- ## All Lectures My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning. This course explores how linear algebra powers modern applications in data analysis, signal processing, and machine learning. --- ::: {.content-grid} ::: {.content-card} **[Lecture 22: Gradient Descent](mit18065-lecture22-gradient-descent.qmd)** Gradient descent through quadratic geometry: Hessian-based convexity, line-search step-size rules, and why condition number controls zigzagging versus fast convergence. ::: ::: {.content-card} **[Lecture 21: Minimizing a Function](mit18065-lecture21-minimizing-function.qmd)** Taylor expansion introduces Hessian/Jacobian roles, Newton method updates for root finding and optimization, quadratic convergence intuition, and why full Newton is expensive in deep learning; ends with convex-set/function conditions and global-minimum guarantees. ::: ::: {.content-card} **[Lecture 1: The Column Space of A Contains All Vectors Ax](mit18065-lecture1-column-space.qmd)** How column combinations define $C(A)$, why rank counts independent columns, and how CR factorization exposes the basis for column and row space. A column-by-row view of $AB$ shows every column of $AB$ lies in $C(A)$. ::: ::: {.content-card} **[Lecture 2: Multiplying and Factoring Matrices](mit18065-lecture2-multiplying-factoring.qmd)** From elimination and $LU$ to rank-1 updates, $QR$, spectral decomposition, diagonalization, and SVD. Ends with the four fundamental subspaces and a null space example. ::: ::: {.content-card} **[Lecture 3: Orthonormal Columns](mit18065-lecture3-orthonormal-columns.qmd)** Orthogonal matrices with $Q^\top Q = I$: rotation matrices that preserve length, reflection matrices (including Householder reflections $H = I - 2uu^\top$), Hadamard matrices built recursively for signal processing, and wavelets that capture multiscale structure. Eigenvectors of symmetric matrices are orthogonal—culminating in discrete Fourier transforms. ::: ::: {.content-card} **[Lecture 4: Eigenvalues and Eigenvectors](mit18065-lecture4-eigenvalues-eigenvectors.qmd)** Eigenvectors $Ax = \lambda x$ preserve direction under transformation. Key properties: $A^k x = \lambda^k x$, solving difference equations $v_{k+1} = Av_k$ and differential equations. Similar matrices $B = M^{-1}AM$ share eigenvalues. Symmetric matrices have real eigenvalues and orthogonal eigenvectors, enabling spectral decomposition $S = Q\Lambda Q^\top$. ::: ::: {.content-card} **[Lecture 5: Positive Definite and Semidefinite Matrices](mit18065-lecture5-positive-definite.qmd)** Positive definite matrices have all $\lambda_i > 0$ and energy $x^\top Sx > 0$ for all $x \ne 0$. Factorize as $S = AA^\top$ with full rank A. The quadratic form creates an energy bowl—convex and minimizable via gradient descent. This connects linear algebra to optimization: deep learning minimizes loss landscapes just as we minimize $x^\top Sx$. Positive semidefinite relaxes to $\lambda_i \ge 0$. ::: ::: {.content-card} **[Lecture 6: Singular Value Decomposition (SVD)](mit18065-lecture6-svd.qmd)** SVD factorizes any $m \times n$ matrix as $A = U\Sigma V^\top$: rotate-scale-rotate. Compute U from eigenvectors of $AA^\top$, V from eigenvectors of $A^\top A$, and $\Sigma$ from square roots of shared eigenvalues. Singular values relate to eigenvalues: $\det A = \prod \sigma_i = \prod \lambda_i$ for square matrices. Polar decomposition $A = (U\Sigma U^\top)(UV^\top)$ splits any matrix into symmetric positive semidefinite × orthogonal. Exercises cover computing SVD and understanding quadratic forms through eigendecomposition. ::: ::: {.content-card} **[Lecture 7: Eckart-Young Theorem - The Closest Rank k Matrix](mit18065-lecture7-eckart-young.qmd)** The Eckart-Young theorem proves that truncated SVD $A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top$ gives the best rank-$k$ approximation: $\|A - B\| \ge \|A - A_k\|$ for any rank-$k$ matrix $B$. Matrix norms (spectral, Frobenius, nuclear) measure approximation quality. Real-world applications: Netflix collaborative filtering uses low-rank matrices to predict movie ratings, MRI reconstruction exploits natural image structure. PCA finds best low-dimensional subspace by computing the dominant singular vectors of centered data. ::: ::: {.content-card} **[Lecture 8: Norms of Vectors and Matrices](mit18065-lecture8-norms.qmd)** Understanding vector p-norms ($\|v\|_p$) and when they satisfy the triangle inequality (only for $p \geq 1$). The "$\frac{1}{2}$-norm" creates non-convex unit balls for strong sparsity. S-norms $\sqrt{v^T S v}$ generalize to ellipses. Matrix norms: spectral $\|A\|_2 = \sigma_1$, Frobenius $\|A\|_F = \sqrt{\sum \sigma_i^2}$, nuclear $\|A\|_N = \sum \sigma_i$—all expressed via singular values. ::: ::: {.content-card} **[Lecture 9: Four Ways to Solve Least Squares Problems](mit18065-lecture9-least-squares.qmd)** Four equivalent methods for solving $Ax = b$ when $A$ has no inverse: pseudo-inverse $\hat{x} = A^+ b$ using SVD, normal equations $A^T A \hat{x} = A^T b$, algebraic minimization of $\|Ax - b\|_2^2$, and geometric projection of $b$ onto $C(A)$. All methods converge to the same solution when $A^T A$ is invertible: $\hat{x} = (A^T A)^{-1}A^T b$. ::: ::: {.content-card} **[Lecture 10: Survey of Difficulties of Ax=b](mit18065-lecture10-axb-difficulties.qmd)** Comprehensive survey of challenges in solving $Ax=b$: overdetermined systems require least squares, underdetermined systems select minimum-norm solutions via $A^+b$, ill-conditioned problems add ridge penalty $\min ||Ax-b||^2+\delta^2||x||^2$, large systems use Krylov methods, and massive-scale problems employ randomized linear algebra. Moore-Penrose pseudo-inverse unifies all regimes: exact inverse when square, least-squares when overdetermined, minimum-norm when underdetermined, and the zero-regularization limit of ridge regression. Deep learning operates in underdetermined regime where gradient descent implicitly biases toward low-norm solutions. ::: ::: {.content-card} **[Lecture 11: Minimize ||x|| subject to Ax=b](lecture-11-minimize-norm-subject-to-constraint.qmd)** Three norms produce different solution geometries: L1 promotes sparsity (diamond touches constraint at corner), L2 yields smooth isotropic solutions (circle tangency), and L∞ balances components (square equalization). QR factorization via Gram–Schmidt computes minimum-norm solutions; column pivoting improves numerical stability for nearly dependent columns. Krylov subspace methods solve large sparse systems by restricting search to low-dimensional spaces built via matrix–vector products, with Arnoldi–Lanczos orthogonalization ensuring stability. ::: ::: {.content-card} **[Lecture 12: Computing Eigenvalues and Eigenvectors](mit18065-lecture12-eigenvalue-algorithms.qmd)** Eigenvalues are defined by $\det(A-\lambda I)=0$ but computed by stable similarity transformations. QR iteration transforms matrices toward triangular form where eigenvalues appear on the diagonal; shifted QR with Wilkinson shifts accelerates convergence. Reduction to Hessenberg (or tridiagonal for symmetric matrices) enables efficient iteration. For large sparse problems, Krylov subspace methods (Arnoldi/Lanczos) approximate eigenvalues using only matrix-vector products—the Abel–Ruffini theorem explains why closed-form solutions fail for $n \ge 5$. ::: ::: {.content-card} **[Lecture 13: Randomized Matrix Multiplication](mit18065-lecture13-randomized-matrix-multiplication.qmd)** Randomized matrix multiplication approximates $AB$ by sampling rank-1 outer products. The estimator remains unbiased, and the variance is minimized by sampling indices proportional to $\|a_j\|\,\|b_j\|$. This yields a low-rank approximation with predictable error decay in $1/\sqrt{s}$. ::: ::: {.content-card} **[Lecture 14: Low Rank Changes and Their Inverse](mit18065-lecture14-low-rank-changes-inverse.qmd)** Low-rank updates reuse an existing inverse via the Sherman–Morrison–Woodbury identity. Rank-1 updates power recursive least squares; the full SMW formula updates $(A-UV^\top)^{-1}$ in terms of $A^{-1}$, avoiding a full recomputation. ::: ::: {.content-card} **[Lecture 15: Matrix Derivatives and Eigenvalue Changes](mit18065-lecture15-matrix-derivatives.qmd)** Differentiate the inverse with $\dot A^{-1}=-A^{-1}\dot A A^{-1}$, and track eigenvalue motion via $\dot\\lambda=y^\top \dot A x$. Low-rank PSD updates interlace eigenvalues, bounding how far each can move in the ordering. ::: ::: {.content-card} **[Lecture 16: Derivative of Inverse and Singular Values](mit18065-lecture16-derivative-inverse-singular-values.qmd)** Matrix derivatives require order-sensitive product rules. The singular value derivative is $\dot\sigma=u^\top \dot A v$, Weyl’s inequality bounds eigenvalue shifts, and nuclear norm minimization relaxes rank for matrix completion. ::: ::: {.content-card} **[Lecture 17: Rapidly Decreasing Singular Values](mit18065-lecture17-rapidly-decreasing-singular-values.qmd)** Rapid singular-value decay makes a matrix effectively low rank. This lecture connects numerical rank, image compression, Hilbert/Vandermonde examples, and Sylvester equations via spectral separation. ::: ::: {.content-card} **[Lecture 18: Counting Parameters in SVD, LU, QR, and Saddle Points](mit18065-lecture18-counting-parameters-svd-lu-qr-saddle-points.qmd)** How LU/QR/eigendecomposition/SVD split degrees of freedom, with parameter checks for each case. Ends with saddle points from KKT systems and Rayleigh quotients. ::: ::: {.content-card} **[Lecture 19: Saddle Point and the Max–Min Principle](mit18065-lecture19-saddle-point-maxmin-principle.qmd)** Stationary points of the Rayleigh quotient are eigenvectors: max, min, and saddle values. The Courant–Fischer max–min principle interprets intermediate eigenvalues. Ends with polynomial interpolation overfitting and a quick probability refresher on variance and covariance. ::: ::: {.content-card} **[Lecture 20: Probability Definitions and Inequalities](mit18065-lecture20-probability-definitions-inequalities.qmd)** Variance as E[X^2]-(E[X])^2, distribution-free bounds via Markov and Chebyshev, and the bridge from joint probability tables to covariance matrices. ::: :::