ickma.dev

Latest Updates

∇ Goodfellow Deep Learning Book 64 chapters

My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Chapter 20: Deep Generative Models Deep generative modeling spans energy-based models, directed latent-variable models, and implicit generators. This chapter surveys RBMs, DBNs/DBMs, VAEs, GANs, autoregressive models, and evaluation pitfalls.

Chapter 19: Approximate Inference Exact posterior inference is intractable in deep latent models, so we optimize the ELBO instead. EM, MAP inference, and mean-field variational updates provide scalable approximations.

Chapter 18: Confronting the Partition Function Energy-based models require a partition function for normalization. This chapter follows how $\nabla_\theta \log Z(\theta)$ enters the log-likelihood gradient and surveys training strategies like contrastive divergence, pseudolikelihood, score matching, NCE, and AIS that avoid or estimate $Z$.

Chapter 17: Monte Carlo Methods Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes.

See all Deep Learning chapters →

📄 Papers in Deep Learning 6 notes

Paper reading notes that focus on key ideas, math intuition, and practical takeaways.

Generative Adversarial Nets A minimax game between generator and discriminator, with the optimal discriminator derivation and the global optimum condition where generated and real distributions match.

Why TPU Is Fast for Dot Product (First-Gen TPU) An early-TPU perspective (CNN/RNN era): focus on the systolic-array MMU to understand why ASIC specialization can greatly outperform general-purpose processors on matrix MAC inference workloads.

Transformer: Attention Is All You Need Attention-only sequence modeling with positional encoding, scaled dot-product attention, and multi-head projections for parallel long-range dependency learning.

Attention: The Origin of Transformer Learnable alignment scores and a dynamic context vector replace the fixed encoder bottleneck in seq2seq models.

See all Deep Learning papers →

🧪 Theory-to-Repro 1 note

Low-level ML understanding and paper reproduction through derivations and code.

Linear Regression via Three Solvers Solve the same least-squares objective with pseudo-inverse, convex optimization, and SGD, then compare assumptions and scalability.

See all Theory-to-Repro notes →

∫ Calculus 9 notes

Foundational notes on calculus, centered on rates of change, accumulation, and geometric intuition.

Gilbert Strang’s Calculus: Limits and Continuous Functions Formal $\epsilon$-$N$ and $\epsilon$-$\delta$ definitions, indeterminate forms, a $1^\infty$ limit example, L’Hopital’s rule, and the core link between limits and continuity.

Gilbert Strang’s Calculus: Chains and the Chain Rule Chain rule for composite functions with clean examples, delta-ratio intuition, and Gaussian first/second derivatives via chain + product rules.

Gilbert Strang’s Calculus: Product Rule and Quotient Rule From rectangle-area increments to formula derivation: product rule, quotient rule, chain-like power patterns for f(x)^n, and worked checks with sqrt(x) and 1/x^n.

Gilbert Strang’s Calculus: Derivative of sin x and cos x A clean derivation of d(sin x)/dx and d(cos x)/dx using unit-circle inequalities, radian limits, and the key result (cos h - 1)/h -> 0.

See all Calculus notes →

📐 MIT 18.06SC Linear Algebra 36 lectures

My journey through MIT’s Linear Algebra course, focusing on building intuition and making connections between fundamental concepts.

Lecture 27: Positive Definite Matrices and Minima Connecting positive definite matrices to multivariable calculus and optimization: the Hessian matrix, second derivative tests, and the geometric interpretation of quadratic forms as ellipsoids.

Lecture 26: Complex Matrices and Fast Fourier Transform Extending linear algebra to complex vectors: Hermitian matrices, unitary matrices, and the Fast Fourier Transform algorithm that reduces DFT complexity from O(N²) to O(N log N).

Lecture 28: Similar Matrices and Jordan Form When matrices share eigenvalues but differ in structure: similar matrices represent the same transformation in different bases, and Jordan form reveals the canonical structure when diagonalization fails.

Lecture 25: Symmetric Matrices and Positive Definiteness The beautiful structure of symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral decomposition, and the important concept of positive definiteness.

See all MIT 18.06SC lectures →

📐 MIT 18.065: Linear Algebra Applications 22 lectures

My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning—exploring how linear algebra powers modern applications.

Lecture 22: Gradient Descent Gradient descent behavior on quadratic landscapes: Hessian-based convexity, step-size selection via line search, and condition number as the key speed limiter.

Lecture 21: Minimizing a Function Taylor expansions introduce Hessian and Jacobian roles, Newton updates for minimization, quadratic convergence intuition, and why full Hessian methods are expensive at deep-learning scale.

Lecture 20: Probability Definitions and Inequalities Variance as E[X^2]-(E[X])^2, distribution-free bounds from Markov and Chebyshev inequalities, and the bridge from joint probability tables to covariance matrices.

Lecture 19: Saddle Point and the Max–Min Principle Stationary points of the Rayleigh quotient are eigenvectors: max, min, and saddle values. The Courant–Fischer max–min principle explains intermediate eigenvalues, followed by interpolation overfitting and a variance/covariance refresher.

See all MIT 18.065 lectures →

📐 Stanford EE 364A: Convex Optimization 15 lectures

My notes from Stanford EE 364A: Convex Optimization—theory and applications of optimization problems.

Chapter 5.1: The Lagrange Dual Function The dual function takes the infimum of the Lagrangian over $x$, producing a concave lower bound on the primal optimum and revealing a direct connection to conjugate functions.

Chapter 4.7: Vector Optimization Vector objectives are only partially ordered, so a global optimum may not exist. Pareto optimality captures efficient trade-offs, and scalarization recovers Pareto points via weighted sums.

Chapter 4.6: Generalized Inequality Constraints Generalized inequalities replace componentwise order with cone-induced order. Cone programs unify LP and SOCP, while SDPs use positive semidefinite cones to express linear matrix inequalities.

Chapter 4.5: Geometric Programming Geometric programming uses monomials and posynomials over positive variables; a log change of variables turns constraints into convex log-sum-exp and affine forms, enabling global optimization with applications like cantilever beam design.

See all EE 364A lectures →

--- title: "ickma.dev" --- ::: {.hero-banner} # ickma.dev — Notes on Deep Learning and Math A growing collection of structured study notes and visual explanations — written for clarity, reproducibility, and long-term memory. ::: ## Latest Updates ::: {.section-block} ### ∇ Goodfellow Deep Learning Book 64 chapters My notes on the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. ::: {.content-grid} ::: {.content-card} **[Chapter 20: Deep Generative Models](ML/deep-generative-models.qmd)** Deep generative modeling spans energy-based models, directed latent-variable models, and implicit generators. This chapter surveys RBMs, DBNs/DBMs, VAEs, GANs, autoregressive models, and evaluation pitfalls. ::: ::: {.content-card} **[Chapter 19: Approximate Inference](ML/approximate-inference.qmd)** Exact posterior inference is intractable in deep latent models, so we optimize the ELBO instead. EM, MAP inference, and mean-field variational updates provide scalable approximations. ::: ::: {.content-card} **[Chapter 18: Confronting the Partition Function](ML/confronting-partition-function.qmd)** Energy-based models require a partition function for normalization. This chapter follows how $\nabla_\theta \log Z(\theta)$ enters the log-likelihood gradient and surveys training strategies like contrastive divergence, pseudolikelihood, score matching, NCE, and AIS that avoid or estimate $Z$. ::: ::: {.content-card} **[Chapter 17: Monte Carlo Methods](ML/monte-carlo-methods.qmd)** Monte Carlo estimation replaces intractable expectations with sample averages. Importance sampling reweights proposal draws to reduce variance, while Markov chain methods like Gibbs sampling generate dependent samples when direct sampling is infeasible. Tempering improves mixing across multimodal landscapes. ::: ::: ::: {.see-all-button} [See all Deep Learning chapters →](ML/deep-learning-book.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📄 Papers in Deep Learning 6 notes Paper reading notes that focus on key ideas, math intuition, and practical takeaways. ::: {.content-grid} ::: {.content-card} **[Generative Adversarial Nets](ML/papers/generative-adversarial-nets.qmd)** A minimax game between generator and discriminator, with the optimal discriminator derivation and the global optimum condition where generated and real distributions match. ::: ::: {.content-card} **[Why TPU Is Fast for Dot Product (First-Gen TPU)](ML/papers/why-tpu-is-fast-for-dot-product.qmd)** An early-TPU perspective (CNN/RNN era): focus on the systolic-array MMU to understand why ASIC specialization can greatly outperform general-purpose processors on matrix MAC inference workloads. ::: ::: {.content-card} **[Transformer: Attention Is All You Need](ML/papers/transformer-attention-is-all-you-need.qmd)** Attention-only sequence modeling with positional encoding, scaled dot-product attention, and multi-head projections for parallel long-range dependency learning. ::: ::: {.content-card} **[Attention: The Origin of Transformer](ML/papers/attention-origin-transformer.qmd)** Learnable alignment scores and a dynamic context vector replace the fixed encoder bottleneck in seq2seq models. ::: ::: ::: {.see-all-button} [See all Deep Learning papers →](ML/papers/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 🧪 Theory-to-Repro 1 note Low-level ML understanding and paper reproduction through derivations and code. ::: {.content-grid} ::: {.content-card} **[Linear Regression via Three Solvers](Theory-to-Repro/linear-regression-three-ways.qmd)** Solve the same least-squares objective with pseudo-inverse, convex optimization, and SGD, then compare assumptions and scalability. ::: ::: ::: {.see-all-button} [See all Theory-to-Repro notes →](Theory-to-Repro/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### ∫ Calculus 9 notes Foundational notes on calculus, centered on rates of change, accumulation, and geometric intuition. ::: {.content-grid} ::: {.content-card} **[Gilbert Strang's Calculus: Limits and Continuous Functions](Math/Calculus/limits-continuous-functions.qmd)** Formal $\epsilon$-$N$ and $\epsilon$-$\delta$ definitions, indeterminate forms, a $1^\infty$ limit example, L'Hopital's rule, and the core link between limits and continuity. ::: ::: {.content-card} **[Gilbert Strang's Calculus: Chains and the Chain Rule](Math/Calculus/chain-rule.qmd)** Chain rule for composite functions with clean examples, delta-ratio intuition, and Gaussian first/second derivatives via chain + product rules. ::: ::: {.content-card} **[Gilbert Strang's Calculus: Product Rule and Quotient Rule](Math/Calculus/product-quotient-rule.qmd)** From rectangle-area increments to formula derivation: product rule, quotient rule, chain-like power patterns for f(x)^n, and worked checks with sqrt(x) and 1/x^n. ::: ::: {.content-card} **[Gilbert Strang's Calculus: Derivative of sin x and cos x](Math/Calculus/derivative-sin-cos.qmd)** A clean derivation of d(sin x)/dx and d(cos x)/dx using unit-circle inequalities, radian limits, and the key result (cos h - 1)/h -> 0. ::: ::: ::: {.see-all-button} [See all Calculus notes →](Math/Calculus/index.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📐 MIT 18.06SC Linear Algebra 36 lectures My journey through MIT's Linear Algebra course, focusing on building intuition and making connections between fundamental concepts. ::: {.content-grid} ::: {.content-card} **[Lecture 27: Positive Definite Matrices and Minima](Math/MIT18.06/mit1806-lecture27-positive-definite-minima.qmd)** Connecting positive definite matrices to multivariable calculus and optimization: the Hessian matrix, second derivative tests, and the geometric interpretation of quadratic forms as ellipsoids. ::: ::: {.content-card} **[Lecture 26: Complex Matrices and Fast Fourier Transform](Math/MIT18.06/mit1806-lecture26-complex-matrices-fft.qmd)** Extending linear algebra to complex vectors: Hermitian matrices, unitary matrices, and the Fast Fourier Transform algorithm that reduces DFT complexity from O(N²) to O(N log N). ::: ::: {.content-card} **[Lecture 28: Similar Matrices and Jordan Form](Math/MIT18.06/mit1806-lecture28-similar-matrices-jordan.qmd)** When matrices share eigenvalues but differ in structure: similar matrices represent the same transformation in different bases, and Jordan form reveals the canonical structure when diagonalization fails. ::: ::: {.content-card} **[Lecture 25: Symmetric Matrices and Positive Definiteness](Math/MIT18.06/mit1806-lecture25-symmetric-positive-definite.qmd)** The beautiful structure of symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral decomposition, and the important concept of positive definiteness. ::: ::: ::: {.see-all-button} [See all MIT 18.06SC lectures →](Math/MIT18.06/lectures.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📐 MIT 18.065: Linear Algebra Applications 22 lectures My notes from MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning—exploring how linear algebra powers modern applications. ::: {.content-grid} ::: {.content-card} **[Lecture 22: Gradient Descent](Math/MIT18.065/mit18065-lecture22-gradient-descent.qmd)** Gradient descent behavior on quadratic landscapes: Hessian-based convexity, step-size selection via line search, and condition number as the key speed limiter. ::: ::: {.content-card} **[Lecture 21: Minimizing a Function](Math/MIT18.065/mit18065-lecture21-minimizing-function.qmd)** Taylor expansions introduce Hessian and Jacobian roles, Newton updates for minimization, quadratic convergence intuition, and why full Hessian methods are expensive at deep-learning scale. ::: ::: {.content-card} **[Lecture 20: Probability Definitions and Inequalities](Math/MIT18.065/mit18065-lecture20-probability-definitions-inequalities.qmd)** Variance as E[X^2]-(E[X])^2, distribution-free bounds from Markov and Chebyshev inequalities, and the bridge from joint probability tables to covariance matrices. ::: ::: {.content-card} **[Lecture 19: Saddle Point and the Max–Min Principle](Math/MIT18.065/mit18065-lecture19-saddle-point-maxmin-principle.qmd)** Stationary points of the Rayleigh quotient are eigenvectors: max, min, and saddle values. The Courant–Fischer max–min principle explains intermediate eigenvalues, followed by interpolation overfitting and a variance/covariance refresher. ::: ::: ::: {.see-all-button} [See all MIT 18.065 lectures →](Math/MIT18.065/lectures.qmd){.btn .btn-primary} ::: ::: ::: {.section-block} ### 📐 Stanford EE 364A: Convex Optimization 15 lectures My notes from Stanford EE 364A: Convex Optimization—theory and applications of optimization problems. ::: {.content-grid} ::: {.content-card} **[Chapter 5.1: The Lagrange Dual Function](Math/EE364A/ee364a-chapter5-1-lagrange-dual-function.qmd)** The dual function takes the infimum of the Lagrangian over $x$, producing a concave lower bound on the primal optimum and revealing a direct connection to conjugate functions. ::: ::: {.content-card} **[Chapter 4.7: Vector Optimization](Math/EE364A/ee364a-chapter4-7-vector-optimization.qmd)** Vector objectives are only partially ordered, so a global optimum may not exist. Pareto optimality captures efficient trade-offs, and scalarization recovers Pareto points via weighted sums. ::: ::: {.content-card} **[Chapter 4.6: Generalized Inequality Constraints](Math/EE364A/ee364a-chapter4-6-generalized-inequality-constraints.qmd)** Generalized inequalities replace componentwise order with cone-induced order. Cone programs unify LP and SOCP, while SDPs use positive semidefinite cones to express linear matrix inequalities. ::: ::: {.content-card} **[Chapter 4.5: Geometric Programming](Math/EE364A/ee364a-chapter4-5-geometric-programming.qmd)** Geometric programming uses monomials and posynomials over positive variables; a log change of variables turns constraints into convex log-sum-exp and affine forms, enabling global optimization with applications like cantilever beam design. ::: ::: ::: {.see-all-button} [See all EE 364A lectures →](Math/EE364A/lectures.qmd){.btn .btn-primary} ::: ::: --- ::: {.footer-section} ## More Topics ::: {.footer-grid} ::: {.footer-card} ### Machine Learning - [K-Means Clustering](ML/k_means_clustering.qmd) - [Logistic Regression](ML/logistic_regression.qmd) - [Axis Operations](ML/axis.qmd) ::: ::: {.footer-card} ### Algorithms - [DP Regex](Algorithm/dp_regex.qmd) ::: ::: :::