MIT 18.065: Linear Algebra Applications

Author

Chao Ma

Published

November 29, 2025

All Lectures

My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning. This course explores how linear algebra powers modern applications in data analysis, signal processing, and machine learning.


Lecture 1: The Column Space of A Contains All Vectors Ax How column combinations define \(C(A)\), why rank counts independent columns, and how CR factorization exposes the basis for column and row space. A column-by-row view of \(AB\) shows every column of \(AB\) lies in \(C(A)\).

Lecture 2: Multiplying and Factoring Matrices From elimination and \(LU\) to rank-1 updates, \(QR\), spectral decomposition, diagonalization, and SVD. Ends with the four fundamental subspaces and a null space example.

Lecture 3: Orthonormal Columns Orthogonal matrices with \(Q^\top Q = I\): rotation matrices that preserve length, reflection matrices (including Householder reflections \(H = I - 2uu^\top\)), Hadamard matrices built recursively for signal processing, and wavelets that capture multiscale structure. Eigenvectors of symmetric matrices are orthogonal—culminating in discrete Fourier transforms.

Lecture 4: Eigenvalues and Eigenvectors Eigenvectors \(Ax = \lambda x\) preserve direction under transformation. Key properties: \(A^k x = \lambda^k x\), solving difference equations \(v_{k+1} = Av_k\) and differential equations. Similar matrices \(B = M^{-1}AM\) share eigenvalues. Symmetric matrices have real eigenvalues and orthogonal eigenvectors, enabling spectral decomposition \(S = Q\Lambda Q^\top\).

Lecture 5: Positive Definite and Semidefinite Matrices Positive definite matrices have all \(\lambda_i > 0\) and energy \(x^\top Sx > 0\) for all \(x \ne 0\). Factorize as \(S = AA^\top\) with full rank A. The quadratic form creates an energy bowl—convex and minimizable via gradient descent. This connects linear algebra to optimization: deep learning minimizes loss landscapes just as we minimize \(x^\top Sx\). Positive semidefinite relaxes to \(\lambda_i \ge 0\).

Lecture 6: Singular Value Decomposition (SVD) SVD factorizes any \(m \times n\) matrix as \(A = U\Sigma V^\top\): rotate-scale-rotate. Compute U from eigenvectors of \(AA^\top\), V from eigenvectors of \(A^\top A\), and \(\Sigma\) from square roots of shared eigenvalues. Singular values relate to eigenvalues: \(\det A = \prod \sigma_i = \prod \lambda_i\) for square matrices. Polar decomposition \(A = (U\Sigma U^\top)(UV^\top)\) splits any matrix into symmetric positive semidefinite × orthogonal. Exercises cover computing SVD and understanding quadratic forms through eigendecomposition.

Lecture 7: Eckart-Young Theorem - The Closest Rank k Matrix The Eckart-Young theorem proves that truncated SVD \(A_k = \sum_{i=1}^k \sigma_i u_i v_i^\top\) gives the best rank-\(k\) approximation: \(\|A - B\| \ge \|A - A_k\|\) for any rank-\(k\) matrix \(B\). Matrix norms (spectral, Frobenius, nuclear) measure approximation quality. Real-world applications: Netflix collaborative filtering uses low-rank matrices to predict movie ratings, MRI reconstruction exploits natural image structure. PCA finds best low-dimensional subspace by computing the dominant singular vectors of centered data.

Lecture 8: Norms of Vectors and Matrices Understanding vector p-norms (\(\|v\|_p\)) and when they satisfy the triangle inequality (only for \(p \geq 1\)). The “\(\frac{1}{2}\)-norm” creates non-convex unit balls for strong sparsity. S-norms \(\sqrt{v^T S v}\) generalize to ellipses. Matrix norms: spectral \(\|A\|_2 = \sigma_1\), Frobenius \(\|A\|_F = \sqrt{\sum \sigma_i^2}\), nuclear \(\|A\|_N = \sum \sigma_i\)—all expressed via singular values.

Lecture 9: Four Ways to Solve Least Squares Problems Four equivalent methods for solving \(Ax = b\) when \(A\) has no inverse: pseudo-inverse \(\hat{x} = A^+ b\) using SVD, normal equations \(A^T A \hat{x} = A^T b\), algebraic minimization of \(\|Ax - b\|_2^2\), and geometric projection of \(b\) onto \(C(A)\). All methods converge to the same solution when \(A^T A\) is invertible: \(\hat{x} = (A^T A)^{-1}A^T b\).

Lecture 10: Survey of Difficulties of Ax=b Comprehensive survey of challenges in solving \(Ax=b\): overdetermined systems require least squares, underdetermined systems select minimum-norm solutions via \(A^+b\), ill-conditioned problems add ridge penalty \(\min ||Ax-b||^2+\delta^2||x||^2\), large systems use Krylov methods, and massive-scale problems employ randomized linear algebra. Moore-Penrose pseudo-inverse unifies all regimes: exact inverse when square, least-squares when overdetermined, minimum-norm when underdetermined, and the zero-regularization limit of ridge regression. Deep learning operates in underdetermined regime where gradient descent implicitly biases toward low-norm solutions.

Lecture 11: Minimize ||x|| subject to Ax=b Three norms produce different solution geometries: L1 promotes sparsity (diamond touches constraint at corner), L2 yields smooth isotropic solutions (circle tangency), and L∞ balances components (square equalization). QR factorization via Gram–Schmidt computes minimum-norm solutions; column pivoting improves numerical stability for nearly dependent columns. Krylov subspace methods solve large sparse systems by restricting search to low-dimensional spaces built via matrix–vector products, with Arnoldi–Lanczos orthogonalization ensuring stability.

Lecture 12: Computing Eigenvalues and Eigenvectors Eigenvalues are defined by \(\det(A-\lambda I)=0\) but computed by stable similarity transformations. QR iteration transforms matrices toward triangular form where eigenvalues appear on the diagonal; shifted QR with Wilkinson shifts accelerates convergence. Reduction to Hessenberg (or tridiagonal for symmetric matrices) enables efficient iteration. For large sparse problems, Krylov subspace methods (Arnoldi/Lanczos) approximate eigenvalues using only matrix-vector products—the Abel–Ruffini theorem explains why closed-form solutions fail for \(n \ge 5\).

Lecture 13: Randomized Matrix Multiplication Randomized matrix multiplication approximates \(AB\) by sampling rank-1 outer products. The estimator remains unbiased, and the variance is minimized by sampling indices proportional to \(\|a_j\|\,\|b_j\|\). This yields a low-rank approximation with predictable error decay in \(1/\sqrt{s}\).