Deep Learning Book
All Chapters
My notes and implementations while studying the Deep Learning book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Chapter 6: Deep Feedforward Networks
Chapter 6.1: XOR Problem & ReLU Networks How ReLU solves problems that linear models cannot handle.
Chapter 6.2: Likelihood-Based Loss Functions The mathematical connection between probabilistic models and loss functions.
Chapter 6.3: Hidden Units and Activation Functions Exploring activation functions and their impact on neural network learning.
Chapter 6.4: Architecture Design - Depth vs Width How depth enables hierarchical feature reuse and exponential expressiveness.
Chapter 6.5: Back-Propagation and Other Differentiation Algorithms The algorithm that makes training deep networks computationally feasible.
Chapter 7: Regularization for Deep Learning
Chapter 7 Prerequisites: Hessian Matrix, Definiteness, and Curvature Essential second-order calculus concepts needed before Chapter 7.
Chapter 7.1.1: L2 Regularization How L2 regularization shrinks weights based on Hessian eigenvalues.
Chapter 7.1.2: L1 Regularization L1 regularization uses soft thresholding to create sparse solutions.
Chapter 7.2: Constrained Optimization View of Regularization Regularization as constrained optimization with KKT conditions.
Chapter 7.3: Regularization and Under-Constrained Problems Why regularization is mathematically necessary and ensures invertibility.
Chapter 7.4: Dataset Augmentation How transforming existing data improves generalization.
Chapter 7.5: Noise Robustness How adding Gaussian noise to weights is equivalent to penalizing large gradients.
Chapter 7.6: Semi-Supervised Learning Leveraging unlabeled data to improve model performance when labeled data is scarce.
Chapter 7.7: Multi-Task Learning Training a single model on multiple related tasks to improve generalization.
Chapter 7.8: Early Stopping Early stopping as implicit L2 regularization: fewer training steps equals stronger regularization.
Chapter 7.9: Parameter Tying and Parameter Sharing Two strategies for reducing parameters: encouraging similarity vs. enforcing identity.
Chapter 7.10: Sparse Representations Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity.
Chapter 7.11: Bagging and Other Ensemble Methods How training multiple models on bootstrap samples and averaging their predictions reduces variance.
Chapter 7.12: Dropout Dropout as a computationally efficient alternative to bagging - training an ensemble of subnetworks by randomly dropping units.
Chapter 7.13: Adversarial Training How training on adversarial examples improves model robustness by reducing sensitivity to imperceptible perturbations.
Chapter 7.14: Tangent Distance, Tangent Prop and Manifold Tangent Classifier Enforcing invariance along manifold tangent directions to regularize models against meaningful transformations.
Chapter 8: Optimization for Training Deep Models
Chapter 8.1: How Learning Differs from Pure Optimization Why machine learning optimization is fundamentally different from pure optimization and why mini-batch methods work.
Chapter 8.2: Challenges in Deep Learning Optimization Understanding why deep learning optimization is hard: ill-conditioning, local minima, saddle points, exploding gradients, and the theoretical limits of optimization.
Chapter 8.3: Basic Algorithms SGD, momentum, and Nesterov momentum: the foundational algorithms for training deep neural networks.
Chapter 8.4: Parameter Initialization Strategies Why initialization matters: breaking symmetry, avoiding null spaces, and finding the right balance for convergence and generalization.
Chapter 8.5: Algorithms with Adaptive Learning Rates From AdaGrad to Adam: how adaptive learning rates automatically tune optimization for each parameter.
Chapter 8.6: Second-Order Optimization Methods Newton’s method, Conjugate Gradient, and BFGS: elegant methods using curvature information, but rarely used in deep learning due to computational cost.
Chapter 8.7: Optimization Strategies and Meta-Algorithms Advanced optimization strategies: batch normalization, coordinate descent, Polyak averaging, supervised pretraining, and continuation methods that enhance training efficiency and stability.
Chapter 9: Convolutional Networks
Chapter 9.1: Convolution Computation The mathematical foundation of CNNs: from continuous convolution to discrete 2D operations. Understand why deep learning uses cross-correlation (not true convolution), and how parameter sharing and translation equivariance make CNNs powerful for spatial data.
Chapter 9.2: Motivation for Convolutional Networks Why CNNs dominate computer vision: sparse interactions reduce parameters from O(m·n) to O(k·n), parameter sharing enforces translation invariance, and equivariance ensures patterns are detected anywhere—achieving 30,000× speedup over dense layers.
Chapter 9.3: Pooling Downsampling through local aggregation: max pooling provides translation invariance by selecting strongest activations, reducing 28×28 feature maps to 14×14 (4× fewer activations). Comparing three architectures—strided convolutions, max pooling networks, and global average pooling that eliminates fully connected layers.
Chapter 9.4: Convolution and Pooling as an Infinitely Strong Prior Why CNNs work on images but not everywhere: architectural constraints (local connectivity + weight sharing) act as infinitely strong Bayesian priors, assigning probability 1 to translation-equivariant functions and 0 to all others. The bias-variance trade-off—strong priors reduce sample complexity but only when assumptions match the data structure.
Chapter 9.5: Convolutional Functions Mathematical details of convolution operations: deep learning uses cross-correlation (not true convolution), multi-channel formula \(Z_{l,x,y} = \sum_{i,j,k} V_{i,x+j-1,y+k-1} K_{l,i,j,k}\), stride for downsampling, three padding strategies (valid/same/full), and gradient computation—kernel gradients via correlation with input, input gradients via convolution with flipped kernel.
Chapter 9.6: Structured Outputs CNNs can generate high-dimensional structured objects through pixel-level predictions. Preserving spatial dimensions (no pooling, no stride > 1, SAME padding) enables full-resolution outputs. Recurrent convolution refines predictions iteratively: \(U*X + H(t-1)*W = H(t)\), producing dense predictions for segmentation, depth estimation, and flow prediction.
Chapter 9.7: Data Types CNNs can operate on different data types: 1D (audio, time series), 2D (images), and 3D (videos, CT scans) with varying channel counts. Unlike fully connected networks, convolutional kernels handle variable-sized inputs by sliding across spatial dimensions, producing outputs that scale accordingly—a unique flexibility for diverse domains.
Chapter 9.8: Efficient Convolution Algorithms Separable convolution reduces computational cost from \(O(HWk^2)\) to \(O(HWk)\) by decomposing a 2D kernel into two 1D filters (vertical and horizontal). Parameter storage shrinks from \(k^2\) to \(2k\). This factorization enables faster, more memory-efficient models without sacrificing accuracy—foundational for architectures like MobileNet.
Chapter 9.9: Unsupervised or Semi-Supervised Feature Learning Before CNNs, computer vision relied on hand-crafted kernels (Sobel, Laplacian, Gaussian) and unsupervised methods (sparse coding, autoencoders, k-means). While these captured simple patterns, they couldn’t match CNNs’ hierarchical, end-to-end feature learning. Modern systems use CNNs to learn features from edges to semantic concepts—making hand-crafted filters largely obsolete.
Chapter 9.10: Neuroscientific Basis for Convolutional Networks V1 simple cells detect oriented edges (modeled by Gabor filters), complex cells pool over simple cells for translation invariance (like CNN pooling). But CNNs lack key biological features: saccadic attention, multisensory integration, top-down feedback, and dynamic receptive fields. While CNNs excel at feed-forward recognition, biological vision is holistic, context-aware, and adaptive.
Chapter 10: Sequence Modeling
Chapter 10.1: Unfold Computation Graph Unfolding computation graphs in RNNs enables parameter sharing across time steps. The same function with fixed parameters processes sequences of any length, compressing input history into fixed-size hidden states that retain only task-relevant information for predictions.
Chapter 10.2: Recurrent Neural Networks RNN architecture with hidden-to-hidden connections, teacher forcing for parallel training, back-propagation through time (BPTT), RNN as directed graphical models with O(τ) parameter efficiency, and context-based sequence-to-sequence models.
Chapter 10.3: Bidirectional RNN Bidirectional RNNs process sequences in both forward and backward directions, allowing predictions to use information from the entire input sequence. Essential for tasks like speech recognition and handwriting recognition where future context matters.
Chapter 10.4: Encoder-Decoder Sequence-to-Sequence Architecture The seq2seq architecture handles variable-length input and output sequences by compressing the input into a fixed context vector C, then decoding it step-by-step. This enables machine translation, summarization, and dialogue generation where input and output lengths differ.
Chapter 10.5: Deep Recurrent Networks Three architectural patterns for adding depth to RNNs: hierarchical hidden states (vertical stacking), deep transition RNNs (MLPs replace transformations), and deep transition with skip connections (residual paths for gradient flow).
Chapter 10.6: Recursive Neural Network Recursive neural networks compute over tree structures rather than linear chains, applying shared composition functions at internal nodes to build hierarchical representations bottom-up. This reduces computation depth from O(τ) to O(log τ), but requires external tree structure specification.
Chapter 10.7: The Challenge of Long-Term Dependencies The fundamental challenge of long-term dependencies in RNNs is training difficulty: gradients propagated across many time steps either vanish exponentially (common) or explode (rare but severe). Eigenvalue analysis shows how powers of the transition matrix govern this instability.
Chapter 10.8: Echo State Networks ESNs fix recurrent weights and train only output weights, viewing the network as a dynamical reservoir. Setting spectral radius near one enables long-term memory retention. Learning reduces to linear regression on hidden states, avoiding backpropagation through time—showing that carefully designed dynamics can capture temporal structure.
Chapter 10.9: Leaky Units and Multiple Time Scales Leaky units separate instantaneous state from long-term integration using \(u^t = \alpha u^{t-1}+(1-\alpha)v^t\). Multiple time scale strategies include temporal skip connections (direct pathways across time steps) and removing short connections (forcing coarser time scales) to address long-term dependencies.
Chapter 10.10: LSTM and GRU LSTM uses learned gates (forget, input, output) to control information flow through explicit cell state paths, enabling adaptive long-term memory retention. GRU simplifies this design by merging forget and input into a single update gate, reducing parameters while maintaining the ability to capture long-range dependencies through gating mechanisms.
Chapter 10.11: Optimizing Long-Term Dependencies Gradient clipping prevents training instability by rescaling gradients when their norm exceeds a threshold, protecting against sudden jumps across steep gradient cliffs. Regularizing information flow aims to maintain \(\|\partial h^t/\partial h^{t-1}\| \approx 1\), though this is rarely used in practice due to computational cost.
Chapter 10.12: Explicit Memory Explicit memory separates storage from computation by introducing addressable memory outside network parameters. Classic architectures (Memory Networks, Neural Turing Machines) are computationally expensive but inspired modern mechanisms—attention, Transformers, and retrieval-augmented models are successful, scalable realizations of this idea.
Chapter 11: Practical Methodology
Chapter 11: Practical Methodology Define performance metrics aligned with application goals. Build baseline systems quickly using architecture patterns matched to data structure. Diagnose by comparing training and test error. Tune learning rate first, then other hyperparameters via random search. Debug systematically by isolating components, checking gradients, and monitoring numerical behavior.
Chapter 12: Applications
Chapter 12.1: Large-Scale Deep Learning Scaling deep learning requires specialized hardware (GPUs, TPUs, ASICs), distributed training strategies (data/model parallelism, asynchronous SGD), and efficiency optimizations (model compression, quantization, pruning). Dynamic computation enables conditional execution for computational efficiency. Specialized accelerators exploit reduced precision and massive parallelism for deployment on resource-constrained devices.
Chapter 12.2: Image Preprocessing and Normalization Preprocessing images through normalization (scaling to [0,1] or [-1,1]) and data augmentation improves training stability and generalization. Global Contrast Normalization (GCN) removes global lighting variations by centering and L2-normalizing images. Local Contrast Normalization (LCN) enhances local structures by normalizing within spatial neighborhoods. Modern networks rely on batch normalization, but explicit contrast normalization remains valuable for challenging datasets.
Chapter 12.3: Automatic Speech Recognition ASR evolution from GMM-HMM (classical statistical approach) through DNN-HMM (~30% error reduction with deep feedforward networks) to end-to-end systems using RNNs/LSTMs with CTC. CNNs treat spectrograms as 2D structures for frequency-invariant modeling. Modern systems learn direct acoustic-to-text mappings without forced alignment, integrating joint acoustic-phonetic modeling and hierarchical representations.
Chapter 12.4: NLP Applications N-gram models compute conditional probabilities over fixed contexts but suffer from sparsity and exponential growth. Neural language models use word embeddings to map discrete tokens into continuous space, enabling generalization across semantically similar words. High-dimensional vocabulary outputs require optimization: short lists partition frequent/rare words, hierarchical softmax reduces complexity to O(log|V|), and importance sampling approximates gradients. Attention mechanisms dynamically focus on relevant input positions, forming weighted context vectors that alleviate fixed-size representation bottlenecks in seq2seq tasks.
Chapter 12.5: Other Applications Collaborative filtering uses matrix factorization to learn latent user and item embeddings, decomposing ratings into user bias, item bias, and personalized interaction. Cold-start problems require side information. Recommendation systems face exploration-exploitation tradeoffs modeled as contextual bandits. Knowledge graphs represent facts as (subject, relation, object) triples; deep learning maps entities and relations to continuous embeddings for link prediction and reasoning. Evaluation challenges arise from open-world assumptions where unseen facts may be missing rather than false.
Chapter 13: Unsupervised Learning
Chapter 13: Linear Factor Models Linear factor models decompose observed data into latent factors: \(x = Wh + b + \text{noise}\). PCA uses Gaussian priors for dimensionality reduction. ICA recovers statistically independent non-Gaussian sources for signal separation. SFA learns slowly-varying features via temporal coherence. Sparse coding enforces L1 sparsity for interpretable representations. These models form the foundation for modern unsupervised learning and generative models.
Chapter 14: Autoencoders Autoencoders learn compressed representations by training encoder \(h=f(x)\) and decoder \(r=g(h)\) to reconstruct inputs through a bottleneck. Undercomplete autoencoders constrain \(\dim(h)<\dim(x)\) to learn meaningful features—linear versions recover PCA subspace. Regularized variants include sparse autoencoders (L1 penalty interpreted as Laplace prior on latent codes), denoising autoencoders (learn manifold structure by reconstructing from corrupted inputs \(\tilde{x}\)), and contractive autoencoders (penalize Jacobian \(\|\partial f/\partial x\|_F^2\) to encourage local invariance). Two competing forces—reconstruction accuracy vs regularization—drive autoencoders to become sensitive along data manifolds while contracting orthogonally. Applications include dimensionality reduction, semantic hashing for fast retrieval, and manifold learning via parametric coordinate systems.
Chapter 15: Representation Learning Greedy layer-wise pretraining learns meaningful representations through unsupervised learning of hierarchical features, providing better initialization than random weights. Transfer learning enables knowledge sharing across tasks by reusing learned representations—generic early-layer features transfer well while late layers adapt to task-specific patterns. Semi-supervised learning leverages both labeled and unlabeled data to discover disentangled causal factors that generate observations. Distributed representations enable exponential gains in capacity through shared features rather than symbolic codes. Depth provides exponential advantages via compositional hierarchies that match natural data structure. Multiple inductive biases (smoothness, sparsity, temporal coherence, manifolds) guide networks toward discovering meaningful underlying causes.
Chapter 16: Structured Probabilistic Models for Deep Learning Structured probabilistic models use graphs to factorize high-dimensional distributions into tractable components by encoding conditional independence. Directed models (Bayesian networks) represent causal relationships via \(p(x)=\prod_i p(x_i|Pa(x_i))\) and support efficient ancestral sampling. Undirected models (Markov random fields) capture symmetric dependencies through clique potentials \(\tilde{p}(x)=\prod_C \phi_C(x_C)\), normalized by partition function Z. Energy-based models express potentials as \(\exp(-E(x))\) via Boltzmann distribution. Converting between directed and undirected models requires moralization or triangulation, typically adding edges and losing independence. Factor graphs explicitly represent factorization structure for message-passing inference. Deep learning embraces approximate inference with large latent variable models, prioritizing scalability over exact computations through distributed representations and parameter sharing.
Chapter 17: Monte Carlo Methods
Chapter 17: Monte Carlo Methods Monte Carlo estimation approximates expectations with samples; importance sampling reweights proposals to reduce variance, and MCMC methods like Gibbs sampling generate dependent samples when direct sampling is impossible. Tempering improves mixing across multimodal landscapes.