ickma.dev

ickma.dev — Notes on Deep Learning and Math

A growing collection of structured study notes and visual explanations — written for clarity, reproducibility, and long-term memory.

Latest Updates

∇ Deep Learning Book 61 chapters

My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Chapter 17: Monte Carlo Methods Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes.

Chapter 16: Structured Probabilistic Models for Deep Learning Structured probabilistic models use graphs to factorize high-dimensional distributions into tractable components by encoding conditional independence. Directed models represent causal relationships and support efficient ancestral sampling. Undirected models capture symmetric dependencies through clique potentials normalized by partition function. Energy-based models express potentials via Boltzmann distribution. Deep learning embraces approximate inference with large latent variable models, prioritizing scalability over exact computations through distributed representations.

Chapter 15: Representation Learning Greedy layer-wise pretraining learns meaningful representations through unsupervised learning of hierarchical features, providing better initialization than random weights. Transfer learning enables knowledge sharing across tasks by reusing learned representations—generic early-layer features transfer well while late layers adapt to task-specific patterns. Semi-supervised learning leverages both labeled and unlabeled data to discover disentangled causal factors that generate observations. Distributed representations enable exponential gains in capacity through shared features rather than symbolic codes.

Chapter 14: Autoencoders Autoencoders learn compressed representations by training encoder \(h=f(x)\) and decoder \(r=g(h)\) to reconstruct inputs through a bottleneck. Undercomplete autoencoders constrain \(\dim(h)<\dim(x)\) to learn meaningful features. Regularized variants include sparse autoencoders (L1 penalty as Laplace prior), denoising autoencoders (learn manifold structure from corrupted inputs), and contractive autoencoders (penalize Jacobian for local invariance). Applications: dimensionality reduction, semantic hashing, and manifold learning.

📐 MIT 18.06SC Linear Algebra 36 lectures

My journey through MIT’s Linear Algebra course, focusing on building intuition and making connections between fundamental concepts.

Lecture 27: Positive Definite Matrices and Minima Connecting positive definite matrices to multivariable calculus and optimization: the Hessian matrix, second derivative tests, and the geometric interpretation of quadratic forms as ellipsoids.

Lecture 26: Complex Matrices and Fast Fourier Transform Extending linear algebra to complex vectors: Hermitian matrices, unitary matrices, and the Fast Fourier Transform algorithm that reduces DFT complexity from O(N²) to O(N log N).

Lecture 28: Similar Matrices and Jordan Form When matrices share eigenvalues but differ in structure: similar matrices represent the same transformation in different bases, and Jordan form reveals the canonical structure when diagonalization fails.

Lecture 25: Symmetric Matrices and Positive Definiteness The beautiful structure of symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral decomposition, and the important concept of positive definiteness.

📐 MIT 18.065: Linear Algebra Applications 13 lectures

My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning—exploring how linear algebra powers modern applications.

Lecture 13: Randomized Matrix Multiplication Randomized matrix multiplication approximates \(AB\) by sampling rank-1 outer products. The estimator remains unbiased, and the variance is minimized by sampling indices proportional to \(\|a_j\|\,\|b_j\|\). This yields a low-rank approximation with predictable error decay in \(1/\sqrt{s}\).

Lecture 12: Computing Eigenvalues and Eigenvectors Eigenvalues are defined by \(\det(A-\lambda I)=0\) but computed by stable similarity transformations. QR iteration transforms matrices toward triangular form where eigenvalues appear on the diagonal; shifted QR with Wilkinson shifts accelerates convergence. Reduction to Hessenberg (or tridiagonal for symmetric matrices) enables efficient iteration. For large sparse problems, Krylov subspace methods (Arnoldi/Lanczos) approximate eigenvalues using only matrix-vector products.

Lecture 11: Minimize ||x|| subject to Ax=b Three norms produce different solution geometries: L1 promotes sparsity (diamond touches constraint at corner), L2 yields smooth isotropic solutions (circle tangency), and L∞ balances components (square equalization). QR factorization via Gram–Schmidt computes minimum-norm solutions; column pivoting improves numerical stability. Krylov subspace methods solve large sparse systems via matrix–vector products.

Lecture 10: Survey of Difficulties of Ax=b Comprehensive survey of challenges in solving \(Ax=b\): overdetermined systems require least squares, underdetermined systems select minimum-norm solutions via \(A^+b\), ill-conditioned problems add ridge penalty \(\min ||Ax-b||^2+\delta^2||x||^2\), large systems use Krylov methods, and massive-scale problems employ randomized linear algebra. Moore-Penrose pseudo-inverse unifies all regimes: exact inverse when square, least-squares when overdetermined, minimum-norm when underdetermined, and the zero-regularization limit of ridge regression. Deep learning operates in underdetermined regime where gradient descent implicitly biases toward low-norm solutions.

📐 Stanford EE 364A: Convex Optimization 11 lectures

My notes from Stanford EE 364A: Convex Optimization—theory and applications of optimization problems.

Chapter 4.4: Quadratic Optimization Problems Quadratic programs minimize convex quadratic objectives with affine constraints; QCQPs allow quadratic inequalities. SOCPs encode norm constraints and unify LP/QP/QCQP formulations. Robust LP handles uncertainty using ellipsoidal sets or Gaussian chance constraints, producing SOCP reformulations.

Chapter 4.3: Linear Optimization Linear programs (LP) minimize linear objectives subject to affine constraints. Geometric interpretation: level curves orthogonal to objective vector \(c\) touch polyhedron feasible set. Applications: diet problem (minimize cost meeting nutritional requirements), piecewise-linear minimization (epigraph formulation \(\min t\) s.t. \(a_i^\top x+b_i\le t\)), Chebyshev center (largest inscribed ball), linear fractional programs (convert to LP via \(y=xz\), \(z=1/(e^\top x+f)\)). Von Neumann growth model: maximize minimum sectoral growth rate using quasiconvex bisection.

Chapter 4.2: Convex Optimization Convex problems in standard form (convex objective/inequalities, affine equalities), local vs global optimality, first-order conditions, equivalent reformulations (change of variables, slack variables), and quasiconvex problems via bisection.

Chapter 4.1: Optimization Problems Basic terminology (decision variables, objective, constraints, domain, feasibility, optimal values, local optimality), standard form conversion, equivalent problems (change of variables, slack variables, constraint elimination, epigraph form), and parameter vs oracle problem descriptions.