Chapter 7.10: Sparse Representations
Overview
Previous chapters focused on parameter regularization — constraining the weights \(W\) of a model (see Chapter 7.1.2: L1 Regularization and Chapter 7.1.1: L2 Regularization).
This chapter introduces representation regularization — constraining the activations \(h\) (the learned representations).
Key distinction:
- Parameter sparsity: Makes the weight matrix sparse (few connections)
- Representation sparsity: Makes the activation vector sparse (few active neurons)
Parameter Regularization
Equation 7.46 - Linear system with parameter matrix:
\[ \begin{array}{c} \begin{bmatrix} 18\\ 5\\ 15\\ -9\\ -3 \end{bmatrix} = \begin{bmatrix} 4 & 0 & 0 & -2 & 0\\ 0 & -1 & 0 & 3 & 0\\ 1 & 0 & 0 & 0 & 3\\ 0 & 5 & 0 & -1 & 0\\ 1 & 0 & 0 & -5 & 0 \end{bmatrix} \begin{bmatrix} 2\\ 3\\ -2\\ -5\\ 1 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m,\quad A \in \mathbb{R}^{m\times n},\quad x \in \mathbb{R}^n \end{array} \]
L1 regularization on parameters:
\[ \Omega(A) = \|A\|_1 = \sum_{i,j} |A_{ij}| \]
Effect: Encourages sparsity in the parameter matrix \(A\) itself, making many weights zero.
Representation Sparsity
Equation 7.47 - Linear system with sparse representation:
\[ \begin{array}{c} \begin{bmatrix} -14\\ 1\\ 3\\ 2\\ 23 \end{bmatrix} = \begin{bmatrix} 3 & -1 & 2 & -5 & 4 & 1\\ 4 & -2 & -3 & -1 & 3 & 0\\ 1 & 5 & 2 & -4 & 0 & 0\\ 3 & 4 & -3 & 0 & 2 & 0\\ -5 & -4 & 2 & 5 & -1 & 3 \end{bmatrix} \begin{bmatrix} 0\\ 0\\ 1\\ 0\\ 0\\ 0 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m, \quad B \in \mathbb{R}^{m\times n}, \quad h \in \mathbb{R}^n \end{array} \]
L1 regularization on representation:
\[ \Omega(h) = \|h\|_1 = \sum_{i} |h_i| \]
Equation 7.48 - Loss function with representation regularization:
\[ \tilde{J}(\theta; x, y) = J(\theta; x, y) + \alpha \Omega(h) \]
where \(\alpha \in [0, \infty)\) controls the contribution of the norm penalty to the total loss.
Key Difference: Parameter vs Representation Regularization
| Type | What is Regularized | Effect | Example |
|---|---|---|---|
| Parameter regularization | Weight matrix \(A\) or \(W\) | Makes weights sparse | L1/L2 weight decay |
| Representation regularization | Activation vector \(h\) | Makes activations sparse | Sparse autoencoders |
Interpretation:
- Parameter sparsity: Few connections between layers (network pruning)
- Representation sparsity: Few neurons active at once (sparse coding)
Orthogonal Matching Pursuit
Equation 7.49 - Sparse approximation problem:
\[ \arg\min_h \|x - Wh\|^2 \quad \text{subject to} \quad \|h\|_0 < k \]
Goal: Find both \(W\) and \(h\) such that:
- \(Wh\) closely approximates \(x\) (reconstruction accuracy)
- \(h\) is sparse — the number of nonzero elements in \(h\) (denoted \(\|h\|_0\)) is less than \(k\)
Interpretation: Use \(Wh\) to represent \(x\) with only a few active components.
When the columns of \(W\) are orthogonal, this problem can be efficiently solved using orthogonal matching pursuit (OMP) algorithm.
Key insight: Orthogonality of basis vectors (columns of \(W\)) enables efficient sparse decomposition.
Real-World Applications
Note: The following table is generated by ChatGPT.
| Category | Technique / Model | Representation Type | How it Uses Sparsity / Orthogonality | Real-World Example / Effect |
|---|---|---|---|---|
| 🧠 Neuroscience & Vision | Sparse coding (Olshausen & Field, 1996) | Sparse | The brain represents visual inputs by activating only a few neurons for each stimulus. | Explains receptive fields in V1 visual cortex — neurons respond only to specific edges or orientations. |
| 🧮 Machine Learning | LASSO Regression | Sparse | Adds \(\lambda \|w\|_1\) penalty to make weights sparse, selecting only key features. | Feature selection in predictive models (finance, genomics, etc.). |
| 🧩 Autoencoders | Sparse Autoencoder | Sparse | Adds regularizer \(\Omega(h) = \|h\|_1\) or KL divergence to force sparse activations. | Used in image compression and pretraining deep networks (unsupervised learning). |
| 🖼️ Image Processing / Compression | Dictionary Learning / K-SVD | Sparse | Represent an image as a combination of few learned basis patches (atoms). | JPEG-like compression, denoising, super-resolution. |
| 🔊 Speech Processing | Non-negative Matrix Factorization (NMF) | Sparse + Nonnegative | Decomposes sound spectrograms into few additive components. | Source separation (e.g., separating voice from music). |
| 🧠 Transform-based Compression | DCT (Discrete Cosine Transform) | Orthogonal | Projects signals onto orthogonal cosine basis vectors. | JPEG compression — energy concentrated in few coefficients. |
| 📶 Signal Processing | Fourier Transform | Orthogonal | Represents time-domain signals in orthogonal sinusoidal basis. | Audio, RF, vibration analysis. |
| 🖥️ Dimensionality Reduction | Principal Component Analysis (PCA) | Orthogonal | Finds orthogonal axes (principal components) that maximize variance. | Dimensionality reduction, data visualization. |
| 🎙️ Source Separation | Independent Component Analysis (ICA) | Sparse-like / Orthogonal | Seeks statistically independent (often sparse) components. | Blind source separation (“cocktail party problem”). |
| 🤖 CNN Filters | Orthogonal Regularization | Orthogonal | Forces convolution filters to be orthogonal to prevent redundancy. | Improves training stability and generalization. |
| 💾 Transformers | Embedding Orthogonalization | Orthogonal or Near-orthogonal | Orthogonalizes embedding vectors for better separation in latent space. | Improves language model expressiveness and prevents collapse. |
Source: Deep Learning Book, Chapter 7.10