ickma.dev

ickma.dev — Notes on Deep Learning and Math

A growing collection of structured study notes and visual explanations — written for clarity, reproducibility, and long-term memory.

Latest Updates

∇ Goodfellow Deep Learning Book 64 chapters

My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Chapter 20: Deep Generative Models Deep generative modeling spans energy-based models, directed latent-variable models, and implicit generators. This chapter surveys RBMs, DBNs/DBMs, VAEs, GANs, autoregressive models, and evaluation pitfalls.

Chapter 19: Approximate Inference Exact posterior inference is intractable in deep latent models, so we optimize the ELBO instead. EM, MAP inference, and mean-field variational updates provide scalable approximations.

Chapter 18: Confronting the Partition Function Energy-based models require a partition function for normalization. This chapter follows how \(\nabla_\theta \log Z(\theta)\) enters the log-likelihood gradient and surveys training strategies like contrastive divergence, pseudolikelihood, score matching, NCE, and AIS that avoid or estimate \(Z\).

Chapter 17: Monte Carlo Methods Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes.

📄 Papers in Deep Learning 1 note

Paper reading notes that focus on key ideas, math intuition, and practical takeaways.

LoRA: Low-Rank Adaptation of Large Language Models Freeze the base model and learn a low‑rank update \(\\Delta W=BA\) for selected layers, enabling efficient fine‑tuning with minimal storage.

🧪 Theory-to-Repro 1 note

Low-level ML understanding and paper reproduction through derivations and code.

Linear Regression via Three Solvers Solve the same least-squares objective with pseudo-inverse, convex optimization, and SGD, then compare assumptions and scalability.

∫ Calculus 1 note

Foundational notes on calculus, centered on rates of change, accumulation, and geometric intuition.

Gilbert Strang’s Calculus: Highlights A concise tour of derivatives, slopes, second derivatives, exponential growth, extrema, and the integral as accumulation—guided by graphs and intuition.

📐 MIT 18.06SC Linear Algebra 36 lectures

My journey through MIT’s Linear Algebra course, focusing on building intuition and making connections between fundamental concepts.

Lecture 27: Positive Definite Matrices and Minima Connecting positive definite matrices to multivariable calculus and optimization: the Hessian matrix, second derivative tests, and the geometric interpretation of quadratic forms as ellipsoids.

Lecture 26: Complex Matrices and Fast Fourier Transform Extending linear algebra to complex vectors: Hermitian matrices, unitary matrices, and the Fast Fourier Transform algorithm that reduces DFT complexity from O(N²) to O(N log N).

Lecture 28: Similar Matrices and Jordan Form When matrices share eigenvalues but differ in structure: similar matrices represent the same transformation in different bases, and Jordan form reveals the canonical structure when diagonalization fails.

Lecture 25: Symmetric Matrices and Positive Definiteness The beautiful structure of symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral decomposition, and the important concept of positive definiteness.

📐 MIT 18.065: Linear Algebra Applications 17 lectures

My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning—exploring how linear algebra powers modern applications.

Lecture 17: Rapidly Decreasing Singular Values Rapid singular-value decay makes matrices effectively low rank, enabling compression. Numerical rank captures this decay, and Sylvester equations explain why spectral separation drives it.

Lecture 16: Derivative of Inverse and Singular Values Order-sensitive derivatives give \(\frac{d}{dt}A^2 = \dot A A + A \dot A\). Singular value motion follows \(\dot\sigma=u^\top \dot A v\), Weyl’s inequality bounds eigenvalue shifts, and nuclear norm minimization enables matrix completion.

Lecture 15: Matrix Derivatives and Eigenvalue Changes Differentiate the inverse with \(\dot A^{-1}=-A^{-1}\dot A A^{-1}\) and track eigenvalue motion via \(\dot\lambda=y^\top \dot A x\). Low-rank PSD updates interlace eigenvalues and bound spectral shifts.

Lecture 14: Low Rank Changes and Their Inverse Low-rank updates reuse an existing inverse via the Sherman–Morrison–Woodbury identity. Rank-1 updates power recursive least squares; the full SMW formula updates \((A-UV^\top)^{-1}\) in terms of \(A^{-1}\), avoiding a full recomputation.

📐 Stanford EE 364A: Convex Optimization 15 lectures

My notes from Stanford EE 364A: Convex Optimization—theory and applications of optimization problems.

Chapter 5.1: The Lagrange Dual Function The dual function takes the infimum of the Lagrangian over \(x\), producing a concave lower bound on the primal optimum and revealing a direct connection to conjugate functions.

Chapter 4.7: Vector Optimization Vector objectives are only partially ordered, so a global optimum may not exist. Pareto optimality captures efficient trade-offs, and scalarization recovers Pareto points via weighted sums.

Chapter 4.6: Generalized Inequality Constraints Generalized inequalities replace componentwise order with cone-induced order. Cone programs unify LP and SOCP, while SDPs use positive semidefinite cones to express linear matrix inequalities.

Chapter 4.5: Geometric Programming Geometric programming uses monomials and posynomials over positive variables; a log change of variables turns constraints into convex log-sum-exp and affine forms, enabling global optimization with applications like cantilever beam design.