ickma.dev

ickma.dev — Notes on Deep Learning and Math

A growing collection of structured study notes and visual explanations — written for clarity, reproducibility, and long-term memory.

Latest Updates

∇ Goodfellow Deep Learning Book 64 chapters

My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Chapter 20: Deep Generative Models Deep generative modeling spans energy-based models, directed latent-variable models, and implicit generators. This chapter surveys RBMs, DBNs/DBMs, VAEs, GANs, autoregressive models, and evaluation pitfalls.

Chapter 19: Approximate Inference Exact posterior inference is intractable in deep latent models, so we optimize the ELBO instead. EM, MAP inference, and mean-field variational updates provide scalable approximations.

Chapter 18: Confronting the Partition Function Energy-based models require a partition function for normalization. This chapter follows how \(\nabla_\theta \log Z(\theta)\) enters the log-likelihood gradient and surveys training strategies like contrastive divergence, pseudolikelihood, score matching, NCE, and AIS that avoid or estimate \(Z\).

Chapter 17: Monte Carlo Methods Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes.

📄 Papers in Deep Learning 6 notes

Paper reading notes that focus on key ideas, math intuition, and practical takeaways.

Generative Adversarial Nets A minimax game between generator and discriminator, with the optimal discriminator derivation and the global optimum condition where generated and real distributions match.

Why TPU Is Fast for Dot Product (First-Gen TPU) An early-TPU perspective (CNN/RNN era): focus on the systolic-array MMU to understand why ASIC specialization can greatly outperform general-purpose processors on matrix MAC inference workloads.

Transformer: Attention Is All You Need Attention-only sequence modeling with positional encoding, scaled dot-product attention, and multi-head projections for parallel long-range dependency learning.

Attention: The Origin of Transformer Learnable alignment scores and a dynamic context vector replace the fixed encoder bottleneck in seq2seq models.

RL 3 notes

Course notes on reinforcement learning, starting with David Silver’s foundational lecture series.

David Silver RL Course - Lecture 3: Planning by Dynamic Programming Dynamic programming in known MDPs: optimal substructure, iterative policy evaluation, policy iteration, value iteration, and the classical gridworld examples.

David Silver RL Course - Lecture 2: Markov Decision Process Markov property, transition matrices, Markov reward processes, return and discounting, Bellman equations, and the move from prediction to control in MDPs.

David Silver RL Course - Lecture 1: Introduction to Reinforcement Learning What makes RL different from supervised learning, the agent-environment loop, Markov state, policy/value/model, and the three core RL tradeoffs.

ML HW-SW Codesign 5 notes

Notes on efficient AI systems where compression, deployment, and specialized hardware have to be designed together.

Efficient AI Lecture 6: Quantization (Part II) Post-training quantization granularity, clipping and calibration, AdaRound, QAT with STE, and binary/ternary quantization methods for pushing precision lower without losing control.

Efficient AI Lecture 5: Quantization (Part I) Why low-bit arithmetic saves energy, how numeric formats trade off range and precision, and how K-means and linear quantization connect compression to hardware-friendly integer compute.

Efficient AI Lecture 4: Pruning and Sparsity (Part II) Layer-wise pruning ratios, automatic pruning with AMC and NetAdapt, fine-tuning after pruning, and the hardware systems that turn sparsity into real speed and energy gains.

Efficient AI Lecture 3: Pruning and Sparsity (Part 1) Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods.

🧪 Theory-to-Repro 1 note

Low-level ML understanding and paper reproduction through derivations and code.

Linear Regression via Three Solvers Solve the same least-squares objective with pseudo-inverse, convex optimization, and SGD, then compare assumptions and scalability.

📐 Math 98 items

Probability, calculus, linear algebra, and optimization notes organized as one mathematical foundation.

MIT 6.041 Probability: Probability Models and Axioms Sample spaces, discrete and continuous models, probability axioms, uniform laws, and counting examples from the opening lecture of MIT 6.041.

Lecture 32: The Convolution Rule Polynomial convolution, cyclic convolution, the convolution theorem, and the 2D Laplacian connection behind filtering, Fourier diagonalization, and FFT acceleration.

Gilbert Strang’s Calculus: Six Functions, Six Rules, and Six Theorems A compact map of early calculus through six core function families, six derivative rules, and six foundational theorems linking computation to structure.

Lecture 31: Eigenvectors of Circulant Matrices, Fourier Matrix Roots-of-unity eigenvalues of the shift matrix, Fourier eigenvectors, why circulant matrices are diagonalized by the Fourier basis, and the link to cyclic convolution and FFT.