Deep Learning Book

Author

Chao Ma

Published

November 29, 2025

All Chapters

My notes and implementations while studying the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Chapter 6: Deep Feedforward Networks

Chapter 6.1: XOR Problem & ReLU Networks How ReLU solves problems that linear models cannot handle.

Chapter 6.2: Likelihood-Based Loss Functions The mathematical connection between probabilistic models and loss functions.

Chapter 6.3: Hidden Units and Activation Functions Exploring activation functions and their impact on neural network learning.

Chapter 6.4: Architecture Design - Depth vs Width How depth enables hierarchical feature reuse and exponential expressiveness.

Chapter 6.5: Back-Propagation and Other Differentiation Algorithms The algorithm that makes training deep networks computationally feasible.

Chapter 7: Regularization for Deep Learning

Chapter 7 Prerequisites: Hessian Matrix, Definiteness, and Curvature Essential second-order calculus concepts needed before Chapter 7.

Chapter 7.1.1: L2 Regularization How L2 regularization shrinks weights based on Hessian eigenvalues.

Chapter 7.1.2: L1 Regularization L1 regularization uses soft thresholding to create sparse solutions.

Chapter 7.2: Constrained Optimization View of Regularization Regularization as constrained optimization with KKT conditions.

Chapter 7.3: Regularization and Under-Constrained Problems Why regularization is mathematically necessary and ensures invertibility.

Chapter 7.4: Dataset Augmentation How transforming existing data improves generalization.

Chapter 7.5: Noise Robustness How adding Gaussian noise to weights is equivalent to penalizing large gradients.

Chapter 7.6: Semi-Supervised Learning Leveraging unlabeled data to improve model performance when labeled data is scarce.

Chapter 7.7: Multi-Task Learning Training a single model on multiple related tasks to improve generalization.

Chapter 7.8: Early Stopping Early stopping as implicit L2 regularization: fewer training steps equals stronger regularization.

Chapter 7.9: Parameter Tying and Parameter Sharing Two strategies for reducing parameters: encouraging similarity vs. enforcing identity.

Chapter 7.10: Sparse Representations Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity.

Chapter 7.11: Bagging and Other Ensemble Methods How training multiple models on bootstrap samples and averaging their predictions reduces variance.

Chapter 7.12: Dropout Dropout as a computationally efficient alternative to bagging - training an ensemble of subnetworks by randomly dropping units.

Chapter 7.13: Adversarial Training How training on adversarial examples improves model robustness by reducing sensitivity to imperceptible perturbations.

Chapter 7.14: Tangent Distance, Tangent Prop and Manifold Tangent Classifier Enforcing invariance along manifold tangent directions to regularize models against meaningful transformations.

Chapter 8: Optimization for Training Deep Models

Chapter 8.1: How Learning Differs from Pure Optimization Why machine learning optimization is fundamentally different from pure optimization and why mini-batch methods work.

Chapter 8.2: Challenges in Deep Learning Optimization Understanding why deep learning optimization is hard: ill-conditioning, local minima, saddle points, exploding gradients, and the theoretical limits of optimization.

Chapter 8.3: Basic Algorithms SGD, momentum, and Nesterov momentum: the foundational algorithms for training deep neural networks.

Chapter 8.4: Parameter Initialization Strategies Why initialization matters: breaking symmetry, avoiding null spaces, and finding the right balance for convergence and generalization.

Chapter 8.5: Algorithms with Adaptive Learning Rates From AdaGrad to Adam: how adaptive learning rates automatically tune optimization for each parameter.

Chapter 8.6: Second-Order Optimization Methods Newton’s method, Conjugate Gradient, and BFGS: elegant methods using curvature information, but rarely used in deep learning due to computational cost.

Chapter 8.7: Optimization Strategies and Meta-Algorithms Advanced optimization strategies: batch normalization, coordinate descent, Polyak averaging, supervised pretraining, and continuation methods that enhance training efficiency and stability.

Chapter 9: Convolutional Networks

Chapter 9.1: Convolution Computation The mathematical foundation of CNNs: from continuous convolution to discrete 2D operations. Understand why deep learning uses cross-correlation (not true convolution), and how parameter sharing and translation equivariance make CNNs powerful for spatial data.

Chapter 9.2: Motivation for Convolutional Networks Why CNNs dominate computer vision: sparse interactions reduce parameters from O(m·n) to O(k·n), parameter sharing enforces translation invariance, and equivariance ensures patterns are detected anywhere—achieving 30,000× speedup over dense layers.

Chapter 9.3: Pooling Downsampling through local aggregation: max pooling provides translation invariance by selecting strongest activations, reducing 28×28 feature maps to 14×14 (4× fewer activations). Comparing three architectures—strided convolutions, max pooling networks, and global average pooling that eliminates fully connected layers.

Chapter 9.4: Convolution and Pooling as an Infinitely Strong Prior Why CNNs work on images but not everywhere: architectural constraints (local connectivity + weight sharing) act as infinitely strong Bayesian priors, assigning probability 1 to translation-equivariant functions and 0 to all others. The bias-variance trade-off—strong priors reduce sample complexity but only when assumptions match the data structure.

Chapter 9.5: Convolutional Functions Mathematical details of convolution operations: deep learning uses cross-correlation (not true convolution), multi-channel formula $Z_{l,x,y} = \sum_{i,j,k} V_{i,x+j-1,y+k-1} K_{l,i,j,k}$, stride for downsampling, three padding strategies (valid/same/full), and gradient computation—kernel gradients via correlation with input, input gradients via convolution with flipped kernel.

Chapter 9.6: Structured Outputs CNNs can generate high-dimensional structured objects through pixel-level predictions. Preserving spatial dimensions (no pooling, no stride > 1, SAME padding) enables full-resolution outputs. Recurrent convolution refines predictions iteratively: $U*X + H(t-1)*W = H(t)$, producing dense predictions for segmentation, depth estimation, and flow prediction.

--- title: "Deep Learning Book" author: "Chao Ma" date: "2025-11-29" --- ## All Chapters My notes and implementations while studying the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. --- ### Chapter 6: Deep Feedforward Networks ::: {.content-grid} ::: {.content-card} **[Chapter 6.1: XOR Problem & ReLU Networks](xor-deep-learning.ipynb)** How ReLU solves problems that linear models cannot handle. ::: ::: {.content-card} **[Chapter 6.2: Likelihood-Based Loss Functions](likelihood-loss-functions.ipynb)** The mathematical connection between probabilistic models and loss functions. ::: ::: {.content-card} **[Chapter 6.3: Hidden Units and Activation Functions](activation-functions.qmd)** Exploring activation functions and their impact on neural network learning. ::: ::: {.content-card} **[Chapter 6.4: Architecture Design - Depth vs Width](architecture-design.qmd)** How depth enables hierarchical feature reuse and exponential expressiveness. ::: ::: {.content-card} **[Chapter 6.5: Back-Propagation and Other Differentiation Algorithms](backpropagation.qmd)** The algorithm that makes training deep networks computationally feasible. ::: ::: --- ### Chapter 7: Regularization for Deep Learning ::: {.content-grid} ::: {.content-card} **[Chapter 7 Prerequisites: Hessian Matrix, Definiteness, and Curvature](hessian-prerequisites.ipynb)** Essential second-order calculus concepts needed before Chapter 7. ::: ::: {.content-card} **[Chapter 7.1.1: L2 Regularization](l2-regularization.qmd)** How L2 regularization shrinks weights based on Hessian eigenvalues. ::: ::: {.content-card} **[Chapter 7.1.2: L1 Regularization](l1-regularization.qmd)** L1 regularization uses soft thresholding to create sparse solutions. ::: ::: {.content-card} **[Chapter 7.2: Constrained Optimization View of Regularization](constrained-optimization-regularization.qmd)** Regularization as constrained optimization with KKT conditions. ::: ::: {.content-card} **[Chapter 7.3: Regularization and Under-Constrained Problems](regularization-underconstrained.qmd)** Why regularization is mathematically necessary and ensures invertibility. ::: ::: {.content-card} **[Chapter 7.4: Dataset Augmentation](dataset-augmentation.qmd)** How transforming existing data improves generalization. ::: ::: {.content-card} **[Chapter 7.5: Noise Robustness](noise-robustness.qmd)** How adding Gaussian noise to weights is equivalent to penalizing large gradients. ::: ::: {.content-card} **[Chapter 7.6: Semi-Supervised Learning](semi-supervised-learning.qmd)** Leveraging unlabeled data to improve model performance when labeled data is scarce. ::: ::: {.content-card} **[Chapter 7.7: Multi-Task Learning](multi-task-learning.qmd)** Training a single model on multiple related tasks to improve generalization. ::: ::: {.content-card} **[Chapter 7.8: Early Stopping](early-stopping.qmd)** Early stopping as implicit L2 regularization: fewer training steps equals stronger regularization. ::: ::: {.content-card} **[Chapter 7.9: Parameter Tying and Parameter Sharing](parameter-tying-sharing.qmd)** Two strategies for reducing parameters: encouraging similarity vs. enforcing identity. ::: ::: {.content-card} **[Chapter 7.10: Sparse Representations](representation-sparsity.qmd)** Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity. ::: ::: {.content-card} **[Chapter 7.11: Bagging and Other Ensemble Methods](bagging-ensemble.qmd)** How training multiple models on bootstrap samples and averaging their predictions reduces variance. ::: ::: {.content-card} **[Chapter 7.12: Dropout](dropout.qmd)** Dropout as a computationally efficient alternative to bagging - training an ensemble of subnetworks by randomly dropping units. ::: ::: {.content-card} **[Chapter 7.13: Adversarial Training](adversarial-training.qmd)** How training on adversarial examples improves model robustness by reducing sensitivity to imperceptible perturbations. ::: ::: {.content-card} **[Chapter 7.14: Tangent Distance, Tangent Prop and Manifold Tangent Classifier](tangent-prop-manifold.qmd)** Enforcing invariance along manifold tangent directions to regularize models against meaningful transformations. ::: ::: --- ### Chapter 8: Optimization for Training Deep Models ::: {.content-grid} ::: {.content-card} **[Chapter 8.1: How Learning Differs from Pure Optimization](learning-vs-optimization.qmd)** Why machine learning optimization is fundamentally different from pure optimization and why mini-batch methods work. ::: ::: {.content-card} **[Chapter 8.2: Challenges in Deep Learning Optimization](optimization-challenges.qmd)** Understanding why deep learning optimization is hard: ill-conditioning, local minima, saddle points, exploding gradients, and the theoretical limits of optimization. ::: ::: {.content-card} **[Chapter 8.3: Basic Algorithms](basic-optimization-algorithms.qmd)** SGD, momentum, and Nesterov momentum: the foundational algorithms for training deep neural networks. ::: ::: {.content-card} **[Chapter 8.4: Parameter Initialization Strategies](parameter-initialization.qmd)** Why initialization matters: breaking symmetry, avoiding null spaces, and finding the right balance for convergence and generalization. ::: ::: {.content-card} **[Chapter 8.5: Algorithms with Adaptive Learning Rates](adaptive-learning-rates.qmd)** From AdaGrad to Adam: how adaptive learning rates automatically tune optimization for each parameter. ::: ::: {.content-card} **[Chapter 8.6: Second-Order Optimization Methods](second-order-methods.qmd)** Newton's method, Conjugate Gradient, and BFGS: elegant methods using curvature information, but rarely used in deep learning due to computational cost. ::: ::: {.content-card} **[Chapter 8.7: Optimization Strategies and Meta-Algorithms](optimization-strategies.qmd)** Advanced optimization strategies: batch normalization, coordinate descent, Polyak averaging, supervised pretraining, and continuation methods that enhance training efficiency and stability. ::: ::: --- ### Chapter 9: Convolutional Networks ::: {.content-grid} ::: {.content-card} **[Chapter 9.1: Convolution Computation](convolution-computation.qmd)** The mathematical foundation of CNNs: from continuous convolution to discrete 2D operations. Understand why deep learning uses cross-correlation (not true convolution), and how parameter sharing and translation equivariance make CNNs powerful for spatial data. ::: ::: {.content-card} **[Chapter 9.2: Motivation for Convolutional Networks](cnn-motivation.qmd)** Why CNNs dominate computer vision: sparse interactions reduce parameters from O(m·n) to O(k·n), parameter sharing enforces translation invariance, and equivariance ensures patterns are detected anywhere—achieving 30,000× speedup over dense layers. ::: ::: {.content-card} **[Chapter 9.3: Pooling](cnn-pooling.qmd)** Downsampling through local aggregation: max pooling provides translation invariance by selecting strongest activations, reducing 28×28 feature maps to 14×14 (4× fewer activations). Comparing three architectures—strided convolutions, max pooling networks, and global average pooling that eliminates fully connected layers. ::: ::: {.content-card} **[Chapter 9.4: Convolution and Pooling as an Infinitely Strong Prior](cnn-infinitely-strong-prior.qmd)** Why CNNs work on images but not everywhere: architectural constraints (local connectivity + weight sharing) act as infinitely strong Bayesian priors, assigning probability 1 to translation-equivariant functions and 0 to all others. The bias-variance trade-off—strong priors reduce sample complexity but only when assumptions match the data structure. ::: ::: {.content-card} **[Chapter 9.5: Convolutional Functions](convolutional-functions.qmd)** Mathematical details of convolution operations: deep learning uses cross-correlation (not true convolution), multi-channel formula $Z_{l,x,y} = \sum_{i,j,k} V_{i,x+j-1,y+k-1} K_{l,i,j,k}$, stride for downsampling, three padding strategies (valid/same/full), and gradient computation—kernel gradients via correlation with input, input gradients via convolution with flipped kernel. ::: ::: {.content-card} **[Chapter 9.6: Structured Outputs](cnn-structured-outputs.qmd)** CNNs can generate high-dimensional structured objects through pixel-level predictions. Preserving spatial dimensions (no pooling, no stride > 1, SAME padding) enables full-resolution outputs. Recurrent convolution refines predictions iteratively: $U*X + H(t-1)*W = H(t)$, producing dense predictions for segmentation, depth estimation, and flow prediction. ::: :::