Deep Learning Book
All Chapters
My notes and implementations while studying the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Chapter 6: Deep Feedforward Networks
Chapter 6.1: XOR Problem & ReLU Networks How ReLU solves problems that linear models cannot handle.
Chapter 6.2: Likelihood-Based Loss Functions The mathematical connection between probabilistic models and loss functions.
Chapter 6.3: Hidden Units and Activation Functions Exploring activation functions and their impact on neural network learning.
Chapter 6.4: Architecture Design - Depth vs Width How depth enables hierarchical feature reuse and exponential expressiveness.
Chapter 6.5: Back-Propagation and Other Differentiation Algorithms The algorithm that makes training deep networks computationally feasible.
Chapter 7: Regularization for Deep Learning
Chapter 7 Prerequisites: Hessian Matrix, Definiteness, and Curvature Essential second-order calculus concepts needed before Chapter 7.
Chapter 7.1.1: L2 Regularization How L2 regularization shrinks weights based on Hessian eigenvalues.
Chapter 7.1.2: L1 Regularization L1 regularization uses soft thresholding to create sparse solutions.
Chapter 7.2: Constrained Optimization View of Regularization Regularization as constrained optimization with KKT conditions.
Chapter 7.3: Regularization and Under-Constrained Problems Why regularization is mathematically necessary and ensures invertibility.
Chapter 7.4: Dataset Augmentation How transforming existing data improves generalization.
Chapter 7.5: Noise Robustness How adding Gaussian noise to weights is equivalent to penalizing large gradients.
Chapter 7.6: Semi-Supervised Learning Leveraging unlabeled data to improve model performance when labeled data is scarce.
Chapter 7.7: Multi-Task Learning Training a single model on multiple related tasks to improve generalization.
Chapter 7.8: Early Stopping Early stopping as implicit L2 regularization: fewer training steps equals stronger regularization.
Chapter 7.9: Parameter Tying and Parameter Sharing Two strategies for reducing parameters: encouraging similarity vs. enforcing identity.
Chapter 7.10: Sparse Representations Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity.
Chapter 7.11: Bagging and Other Ensemble Methods How training multiple models on bootstrap samples and averaging their predictions reduces variance.
Chapter 7.12: Dropout Dropout as a computationally efficient alternative to bagging - training an ensemble of subnetworks by randomly dropping units.
Chapter 7.13: Adversarial Training How training on adversarial examples improves model robustness by reducing sensitivity to imperceptible perturbations.
Chapter 7.14: Tangent Distance, Tangent Prop and Manifold Tangent Classifier Enforcing invariance along manifold tangent directions to regularize models against meaningful transformations.
Chapter 8: Optimization for Training Deep Models
Chapter 8.1: How Learning Differs from Pure Optimization Why machine learning optimization is fundamentally different from pure optimization and why mini-batch methods work.
Chapter 8.2: Challenges in Deep Learning Optimization Understanding why deep learning optimization is hard: ill-conditioning, local minima, saddle points, exploding gradients, and the theoretical limits of optimization.
Chapter 8.3: Basic Algorithms SGD, momentum, and Nesterov momentum: the foundational algorithms for training deep neural networks.
Chapter 8.4: Parameter Initialization Strategies Why initialization matters: breaking symmetry, avoiding null spaces, and finding the right balance for convergence and generalization.
Chapter 8.5: Algorithms with Adaptive Learning Rates From AdaGrad to Adam: how adaptive learning rates automatically tune optimization for each parameter.
Chapter 8.6: Second-Order Optimization Methods Newton’s method, Conjugate Gradient, and BFGS: elegant methods using curvature information, but rarely used in deep learning due to computational cost.
Chapter 8.7: Optimization Strategies and Meta-Algorithms Advanced optimization strategies: batch normalization, coordinate descent, Polyak averaging, supervised pretraining, and continuation methods that enhance training efficiency and stability.
Chapter 9: Convolutional Networks
Chapter 9.1: Convolution Computation The mathematical foundation of CNNs: from continuous convolution to discrete 2D operations. Understand why deep learning uses cross-correlation (not true convolution), and how parameter sharing and translation equivariance make CNNs powerful for spatial data.
Chapter 9.2: Motivation for Convolutional Networks Why CNNs dominate computer vision: sparse interactions reduce parameters from O(m·n) to O(k·n), parameter sharing enforces translation invariance, and equivariance ensures patterns are detected anywhere—achieving 30,000× speedup over dense layers.
Chapter 9.3: Pooling Downsampling through local aggregation: max pooling provides translation invariance by selecting strongest activations, reducing 28×28 feature maps to 14×14 (4× fewer activations). Comparing three architectures—strided convolutions, max pooling networks, and global average pooling that eliminates fully connected layers.
Chapter 9.4: Convolution and Pooling as an Infinitely Strong Prior Why CNNs work on images but not everywhere: architectural constraints (local connectivity + weight sharing) act as infinitely strong Bayesian priors, assigning probability 1 to translation-equivariant functions and 0 to all others. The bias-variance trade-off—strong priors reduce sample complexity but only when assumptions match the data structure.
Chapter 9.5: Convolutional Functions Mathematical details of convolution operations: deep learning uses cross-correlation (not true convolution), multi-channel formula \(Z_{l,x,y} = \sum_{i,j,k} V_{i,x+j-1,y+k-1} K_{l,i,j,k}\), stride for downsampling, three padding strategies (valid/same/full), and gradient computation—kernel gradients via correlation with input, input gradients via convolution with flipped kernel.
Chapter 9.6: Structured Outputs CNNs can generate high-dimensional structured objects through pixel-level predictions. Preserving spatial dimensions (no pooling, no stride > 1, SAME padding) enables full-resolution outputs. Recurrent convolution refines predictions iteratively: \(U*X + H(t-1)*W = H(t)\), producing dense predictions for segmentation, depth estimation, and flow prediction.