ickma.dev

Latest Updates

∇ Goodfellow Deep Learning Book 64 chapters

My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Chapter 20: Deep Generative Models Deep generative modeling spans energy-based models, directed latent-variable models, and implicit generators. This chapter surveys RBMs, DBNs/DBMs, VAEs, GANs, autoregressive models, and evaluation pitfalls.

Chapter 19: Approximate Inference Exact posterior inference is intractable in deep latent models, so we optimize the ELBO instead. EM, MAP inference, and mean-field variational updates provide scalable approximations.

Chapter 18: Confronting the Partition Function Energy-based models require a partition function for normalization. This chapter follows how $\nabla_\theta \log Z(\theta)$ enters the log-likelihood gradient and surveys training strategies like contrastive divergence, pseudolikelihood, score matching, NCE, and AIS that avoid or estimate $Z$.

Chapter 17: Monte Carlo Methods Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes.

See all Deep Learning chapters →

📄 Papers in Deep Learning 6 notes

Paper reading notes that focus on key ideas, math intuition, and practical takeaways.

Generative Adversarial Nets A minimax game between generator and discriminator, with the optimal discriminator derivation and the global optimum condition where generated and real distributions match.

Why TPU Is Fast for Dot Product (First-Gen TPU) An early-TPU perspective (CNN/RNN era): focus on the systolic-array MMU to understand why ASIC specialization can greatly outperform general-purpose processors on matrix MAC inference workloads.

Transformer: Attention Is All You Need Attention-only sequence modeling with positional encoding, scaled dot-product attention, and multi-head projections for parallel long-range dependency learning.

Attention: The Origin of Transformer Learnable alignment scores and a dynamic context vector replace the fixed encoder bottleneck in seq2seq models.

See all Deep Learning papers →

RL 2 notes

Course notes on reinforcement learning, starting with David Silver’s foundational lecture series.

David Silver RL Course - Lecture 2: Markov Decision Process Markov property, transition matrices, Markov reward processes, return and discounting, Bellman equations, and the move from prediction to control in MDPs.

David Silver RL Course - Lecture 1: Introduction to Reinforcement Learning What makes RL different from supervised learning, the agent-environment loop, Markov state, policy/value/model, and the three core RL tradeoffs.

See all RL notes →

ML HW-SW Codesign 2 notes

Notes on efficient AI systems where compression, deployment, and specialized hardware have to be designed together.

Efficient AI Lecture 3: Pruning and Sparsity (Part 1) Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods.

Efficient AI Lecture 1: Introduction Why efficient AI needs both algorithmic compression and hardware specialization: Deep Compression, EIE, MCUNetV3, efficient LMs, and the hardware trends driving co-design.

See all ML HW-SW codesign notes →

🧪 Theory-to-Repro 1 note

Low-level ML understanding and paper reproduction through derivations and code.

Linear Regression via Three Solvers Solve the same least-squares objective with pseudo-inverse, convex optimization, and SGD, then compare assumptions and scalability.

See all Theory-to-Repro notes →

∫ Calculus 13 notes

Foundational notes on calculus, centered on rates of change, accumulation, and geometric intuition.

Gilbert Strang’s Calculus: Power Series and Euler’s Great Formula Build Taylor series from derivatives at zero, derive the series for $e^x$, $\sin x$, and $\cos x$, then combine them into Euler’s formula and the geometric/logarithmic series.

Gilbert Strang’s Calculus: Linear Approximation and Newton’s Method Use tangent-line linearization to approximate nearby values, then turn the same idea into Newton’s root-finding iteration, with examples from $\sqrt{9.06}$ and $e^{0.01}$.

Gilbert Strang’s Calculus: Growth Rates and Log Graphs Compare linear, polynomial, exponential, factorial, and logarithmic growth, then use log scales and log-log plots to reveal power laws and numerical error rates.

Gilbert Strang’s Calculus: Derivatives of ln y and arcsin(y) Inverse functions and the chain rule give $\frac{d}{dy}(\ln y)=\frac{1}{y}$ and $\frac{d}{dy}(\arcsin y)=\frac{1}{\sqrt{1-y^2}}$, with the logarithm filling the missing $y^{-1}$ power.

See all Calculus notes →

📐 MIT 18.06SC Linear Algebra 36 lectures

My journey through MIT’s Linear Algebra course, focusing on building intuition and making connections between fundamental concepts.

Lecture 27: Positive Definite Matrices and Minima Connecting positive definite matrices to multivariable calculus and optimization: the Hessian matrix, second derivative tests, and the geometric interpretation of quadratic forms as ellipsoids.

Lecture 26: Complex Matrices and Fast Fourier Transform Extending linear algebra to complex vectors: Hermitian matrices, unitary matrices, and the Fast Fourier Transform algorithm that reduces DFT complexity from O(N²) to O(N log N).

Lecture 28: Similar Matrices and Jordan Form When matrices share eigenvalues but differ in structure: similar matrices represent the same transformation in different bases, and Jordan form reveals the canonical structure when diagonalization fails.

Lecture 25: Symmetric Matrices and Positive Definiteness The beautiful structure of symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral decomposition, and the important concept of positive definiteness.

See all MIT 18.06SC lectures →

📐 MIT 18.065: Linear Algebra Applications 26 lectures

My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning—exploring how linear algebra powers modern applications.

Lecture 26: Structure of Neural Nets for Deep Learning Neural-network terminology, layer dimensions, affine-plus-activation structure, and the geometric view of ReLU networks as continuous piecewise linear maps built by repeated folding.

Lecture 25: Stochastic Gradient Descent Finite-sum objectives, one-sample updates versus full gradients, the geometric reason SGD fluctuates near the optimum, and why mini-batches balance noise with hardware efficiency.

Lecture 24: Linear Programming and Two-Person Games Standard-form LP, vertex geometry, simplex versus interior-point methods, max-flow=min-cut, and the minimax bridge from linear programming to zero-sum games.

Lecture 23: Accelerating Gradient Descent (Momentum) Why ill-conditioning causes zigzag, how momentum and Nesterov reduce oscillation, and why acceleration improves dependence from kappa to sqrt(kappa) on quadratics.

Lecture 22: Gradient Descent Gradient descent behavior on quadratic landscapes: Hessian-based convexity, step-size selection via line search, and condition number as the key speed limiter.

See all MIT 18.065 lectures →

📐 Stanford EE 364A: Convex Optimization 15 lectures

My notes from Stanford EE 364A: Convex Optimization—theory and applications of optimization problems.

Chapter 5.1: The Lagrange Dual Function The dual function takes the infimum of the Lagrangian over $x$, producing a concave lower bound on the primal optimum and revealing a direct connection to conjugate functions.

Chapter 4.7: Vector Optimization Vector objectives are only partially ordered, so a global optimum may not exist. Pareto optimality captures efficient trade-offs, and scalarization recovers Pareto points via weighted sums.

Chapter 4.6: Generalized Inequality Constraints Generalized inequalities replace componentwise order with cone-induced order. Cone programs unify LP and SOCP, while SDPs use positive semidefinite cones to express linear matrix inequalities.

Chapter 4.5: Geometric Programming Geometric programming uses monomials and posynomials over positive variables; a log change of variables turns constraints into convex log-sum-exp and affine forms, enabling global optimization with applications like cantilever beam design.

See all EE 364A lectures →

--- title: "ickma.dev" --- ::: {.hero-banner} # ickma.dev — Notes on Deep Learning and Math A growing collection of structured study notes and visual explanations — written for clarity, reproducibility, and long-term memory. ::: ## Latest Updates ::: {.section-block} ### ∇ Goodfellow Deep Learning Book 64 chapters My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. ::: {.content-grid} ::: {.content-card} **[Chapter 20: Deep Generative Models](ML/deep-generative-models.qmd)** Deep generative modeling spans energy-based models, directed latent-variable models, and implicit generators. This chapter surveys RBMs, DBNs/DBMs, VAEs, GANs, autoregressive models, and evaluation pitfalls. ::: ::: {.content-card} **[Chapter 19: Approximate Inference](ML/approximate-inference.qmd)** Exact posterior inference is intractable in deep latent models, so we optimize the ELBO instead. EM, MAP inference, and mean-field variational updates provide scalable approximations. ::: ::: {.content-card} **[Chapter 18: Confronting the Partition Function](ML/confronting-partition-function.qmd)** Energy-based models require a partition function for normalization. This chapter follows how $\nabla_\theta \log Z(\theta)$ enters the log-likelihood gradient and surveys training strategies like contrastive divergence, pseudolikelihood, score matching, NCE, and AIS that avoid or estimate $Z$. ::: ::: {.content-card} **[Chapter 17: Monte Carlo Methods](ML/monte-carlo-methods.qmd)** Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes. ::: ::: ::: {.see-all-button} [See all Deep Learning chapters →](ML/deep-learning-book.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📄 Papers in Deep Learning 6 notes Paper reading notes that focus on key ideas, math intuition, and practical takeaways. ::: {.content-grid} ::: {.content-card} **[Generative Adversarial Nets](ML/papers/generative-adversarial-nets.qmd)** A minimax game between generator and discriminator, with the optimal discriminator derivation and the global optimum condition where generated and real distributions match. ::: ::: {.content-card} **[Why TPU Is Fast for Dot Product (First-Gen TPU)](ML/papers/why-tpu-is-fast-for-dot-product.qmd)** An early-TPU perspective (CNN/RNN era): focus on the systolic-array MMU to understand why ASIC specialization can greatly outperform general-purpose processors on matrix MAC inference workloads. ::: ::: {.content-card} **[Transformer: Attention Is All You Need](ML/papers/transformer-attention-is-all-you-need.qmd)** Attention-only sequence modeling with positional encoding, scaled dot-product attention, and multi-head projections for parallel long-range dependency learning. ::: ::: {.content-card} **[Attention: The Origin of Transformer](ML/papers/attention-origin-transformer.qmd)** Learnable alignment scores and a dynamic context vector replace the fixed encoder bottleneck in seq2seq models. ::: ::: ::: {.see-all-button} [See all Deep Learning papers →](ML/papers/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### RL 2 notes Course notes on reinforcement learning, starting with David Silver's foundational lecture series. ::: {.content-grid} ::: {.content-card} **[David Silver RL Course - Lecture 2: Markov Decision Process](ML/RL/david-silver-lecture-2-markov-decision-process.qmd)** Markov property, transition matrices, Markov reward processes, return and discounting, Bellman equations, and the move from prediction to control in MDPs. ::: ::: {.content-card} **[David Silver RL Course - Lecture 1: Introduction to Reinforcement Learning](ML/RL/david-silver-lecture-1-introduction-to-reinforcement-learning.qmd)** What makes RL different from supervised learning, the agent-environment loop, Markov state, policy/value/model, and the three core RL tradeoffs. ::: ::: ::: {.see-all-button} [See all RL notes →](ML/RL/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### ML HW-SW Codesign 2 notes Notes on efficient AI systems where compression, deployment, and specialized hardware have to be designed together. ::: {.content-grid} ::: {.content-card} **[Efficient AI Lecture 3: Pruning and Sparsity (Part 1)](ML/HW-SW-codesign/efficient-ai-lecture-03-pruning-and-sparsity-part-1.qmd)** Why memory dominates energy, how pruning is formulated with an L0 constraint, the hardware tradeoff between unstructured and structured sparsity, and the main pruning criteria from magnitude to second-order and regression-based methods. ::: ::: {.content-card} **[Efficient AI Lecture 1: Introduction](ML/HW-SW-codesign/efficient-ai-lecture-01-introduction.qmd)** Why efficient AI needs both algorithmic compression and hardware specialization: Deep Compression, EIE, MCUNetV3, efficient LMs, and the hardware trends driving co-design. ::: ::: ::: {.see-all-button} [See all ML HW-SW codesign notes →](ML/HW-SW-codesign/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 🧪 Theory-to-Repro 1 note Low-level ML understanding and paper reproduction through derivations and code. ::: {.content-grid} ::: {.content-card} **[Linear Regression via Three Solvers](Theory-to-Repro/linear-regression-three-ways.qmd)** Solve the same least-squares objective with pseudo-inverse, convex optimization, and SGD, then compare assumptions and scalability. ::: ::: ::: {.see-all-button} [See all Theory-to-Repro notes →](Theory-to-Repro/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### ∫ Calculus 13 notes Foundational notes on calculus, centered on rates of change, accumulation, and geometric intuition. ::: {.content-grid} ::: {.content-card} **[Gilbert Strang's Calculus: Power Series and Euler's Great Formula](Math/Calculus/power-series-eulers-great-formula.qmd)** Build Taylor series from derivatives at zero, derive the series for $e^x$, $\sin x$, and $\cos x$, then combine them into Euler's formula and the geometric/logarithmic series. ::: ::: {.content-card} **[Gilbert Strang's Calculus: Linear Approximation and Newton's Method](Math/Calculus/linear-approximation-newtons-method.qmd)** Use tangent-line linearization to approximate nearby values, then turn the same idea into Newton's root-finding iteration, with examples from $\sqrt{9.06}$ and $e^{0.01}$. ::: ::: {.content-card} **[Gilbert Strang's Calculus: Growth Rates and Log Graphs](Math/Calculus/growth-rates-log-graphs.qmd)** Compare linear, polynomial, exponential, factorial, and logarithmic growth, then use log scales and log-log plots to reveal power laws and numerical error rates. ::: ::: {.content-card} **[Gilbert Strang's Calculus: Derivatives of ln y and arcsin(y)](Math/Calculus/derivatives-ln-arcsin.qmd)** Inverse functions and the chain rule give $\frac{d}{dy}(\ln y)=\frac{1}{y}$ and $\frac{d}{dy}(\arcsin y)=\frac{1}{\sqrt{1-y^2}}$, with the logarithm filling the missing $y^{-1}$ power. ::: ::: ::: {.see-all-button} [See all Calculus notes →](Math/Calculus/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📐 MIT 18.06SC Linear Algebra 36 lectures My journey through MIT's Linear Algebra course, focusing on building intuition and making connections between fundamental concepts. ::: {.content-grid} ::: {.content-card} **[Lecture 27: Positive Definite Matrices and Minima](Math/MIT18.06/mit1806-lecture27-positive-definite-minima.qmd)** Connecting positive definite matrices to multivariable calculus and optimization: the Hessian matrix, second derivative tests, and the geometric interpretation of quadratic forms as ellipsoids. ::: ::: {.content-card} **[Lecture 26: Complex Matrices and Fast Fourier Transform](Math/MIT18.06/mit1806-lecture26-complex-matrices-fft.qmd)** Extending linear algebra to complex vectors: Hermitian matrices, unitary matrices, and the Fast Fourier Transform algorithm that reduces DFT complexity from O(N²) to O(N log N). ::: ::: {.content-card} **[Lecture 28: Similar Matrices and Jordan Form](Math/MIT18.06/mit1806-lecture28-similar-matrices-jordan.qmd)** When matrices share eigenvalues but differ in structure: similar matrices represent the same transformation in different bases, and Jordan form reveals the canonical structure when diagonalization fails. ::: ::: {.content-card} **[Lecture 25: Symmetric Matrices and Positive Definiteness](Math/MIT18.06/mit1806-lecture25-symmetric-positive-definite.qmd)** The beautiful structure of symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral decomposition, and the important concept of positive definiteness. ::: ::: ::: {.see-all-button} [See all MIT 18.06SC lectures →](Math/MIT18.06/lectures.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📐 MIT 18.065: Linear Algebra Applications 26 lectures My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning—exploring how linear algebra powers modern applications. ::: {.content-grid} ::: {.content-card} **[Lecture 26: Structure of Neural Nets for Deep Learning](Math/MIT18.065/mit18065-lecture26-structure-of-neural-nets.qmd)** Neural-network terminology, layer dimensions, affine-plus-activation structure, and the geometric view of ReLU networks as continuous piecewise linear maps built by repeated folding. ::: ::: {.content-card} **[Lecture 25: Stochastic Gradient Descent](Math/MIT18.065/mit18065-lecture25-stochastic-gradient-descent.qmd)** Finite-sum objectives, one-sample updates versus full gradients, the geometric reason SGD fluctuates near the optimum, and why mini-batches balance noise with hardware efficiency. ::: ::: {.content-card} **[Lecture 24: Linear Programming and Two-Person Games](Math/MIT18.065/mit18065-lecture24-linear-programming-two-person-games.qmd)** Standard-form LP, vertex geometry, simplex versus interior-point methods, max-flow=min-cut, and the minimax bridge from linear programming to zero-sum games. ::: ::: {.content-card} **[Lecture 23: Accelerating Gradient Descent (Momentum)](Math/MIT18.065/mit18065-lecture23-accelerating-gradient-descent-momentum.qmd)** Why ill-conditioning causes zigzag, how momentum and Nesterov reduce oscillation, and why acceleration improves dependence from kappa to sqrt(kappa) on quadratics. ::: ::: {.content-card} **[Lecture 22: Gradient Descent](Math/MIT18.065/mit18065-lecture22-gradient-descent.qmd)** Gradient descent behavior on quadratic landscapes: Hessian-based convexity, step-size selection via line search, and condition number as the key speed limiter. ::: ::: ::: {.see-all-button} [See all MIT 18.065 lectures →](Math/MIT18.065/lectures.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📐 Stanford EE 364A: Convex Optimization 15 lectures My notes from Stanford EE 364A: Convex Optimization—theory and applications of optimization problems. ::: {.content-grid} ::: {.content-card} **[Chapter 5.1: The Lagrange Dual Function](Math/EE364A/ee364a-chapter5-1-lagrange-dual-function.qmd)** The dual function takes the infimum of the Lagrangian over $x$, producing a concave lower bound on the primal optimum and revealing a direct connection to conjugate functions. ::: ::: {.content-card} **[Chapter 4.7: Vector Optimization](Math/EE364A/ee364a-chapter4-7-vector-optimization.qmd)** Vector objectives are only partially ordered, so a global optimum may not exist. Pareto optimality captures efficient trade-offs, and scalarization recovers Pareto points via weighted sums. ::: ::: {.content-card} **[Chapter 4.6: Generalized Inequality Constraints](Math/EE364A/ee364a-chapter4-6-generalized-inequality-constraints.qmd)** Generalized inequalities replace componentwise order with cone-induced order. Cone programs unify LP and SOCP, while SDPs use positive semidefinite cones to express linear matrix inequalities. ::: ::: {.content-card} **[Chapter 4.5: Geometric Programming](Math/EE364A/ee364a-chapter4-5-geometric-programming.qmd)** Geometric programming uses monomials and posynomials over positive variables; a log change of variables turns constraints into convex log-sum-exp and affine forms, enabling global optimization with applications like cantilever beam design. ::: ::: ::: {.see-all-button} [See all EE 364A lectures →](Math/EE364A/lectures.qmd){.btn .btn-primary} ::: ::: --- ::: {.footer-section} ## More Topics ::: {.footer-grid} ::: {.footer-card} ### Machine Learning - [K-Means Clustering](ML/k_means_clustering.qmd) - [Logistic Regression](ML/logistic_regression.qmd) - [Axis Operations](ML/axis.qmd) ::: ::: {.footer-card} ### Algorithms - [DP Regex](Algorithm/dp_regex.qmd) ::: ::: :::