Chapter 7.10: Sparse Representations

deep learning

regularization

sparsity

sparse coding

Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity

Author

Chao Ma

Published

October 27, 2025

Overview

Previous chapters focused on parameter regularization — constraining the weights $W$ of a model (see Chapter 7.1.2: L1 Regularization and Chapter 7.1.1: L2 Regularization).

This chapter introduces representation regularization — constraining the activations $h$ (the learned representations).

Key distinction:

Parameter sparsity: Makes the weight matrix sparse (few connections)
Representation sparsity: Makes the activation vector sparse (few active neurons)

Parameter Regularization

Equation 7.46 - Linear system with parameter matrix:

\[ \begin{array}{c} \begin{bmatrix} 18\\ 5\\ 15\\ -9\\ -3 \end{bmatrix} = \begin{bmatrix} 4 & 0 & 0 & -2 & 0\\ 0 & -1 & 0 & 3 & 0\\ 1 & 0 & 0 & 0 & 3\\ 0 & 5 & 0 & -1 & 0\\ 1 & 0 & 0 & -5 & 0 \end{bmatrix} \begin{bmatrix} 2\\ 3\\ -2\\ -5\\ 1 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m,\quad A \in \mathbb{R}^{m\times n},\quad x \in \mathbb{R}^n \end{array} \]

L1 regularization on parameters:

\[ \Omega(A) = \|A\|_1 = \sum_{i,j} |A_{ij}| \]

Effect: Encourages sparsity in the parameter matrix $A$ itself, making many weights zero.

Representation Sparsity

Equation 7.47 - Linear system with sparse representation:

\[ \begin{array}{c} \begin{bmatrix} -14\\ 1\\ 3\\ 2\\ 23 \end{bmatrix} = \begin{bmatrix} 3 & -1 & 2 & -5 & 4 & 1\\ 4 & -2 & -3 & -1 & 3 & 0\\ 1 & 5 & 2 & -4 & 0 & 0\\ 3 & 4 & -3 & 0 & 2 & 0\\ -5 & -4 & 2 & 5 & -1 & 3 \end{bmatrix} \begin{bmatrix} 0\\ 0\\ 1\\ 0\\ 0\\ 0 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m, \quad B \in \mathbb{R}^{m\times n}, \quad h \in \mathbb{R}^n \end{array} \]

L1 regularization on representation:

\[ \Omega(h) = \|h\|_1 = \sum_{i} |h_i| \]

Equation 7.48 - Loss function with representation regularization:

\[ \tilde{J}(\theta; x, y) = J(\theta; x, y) + \alpha \Omega(h) \]

where $\alpha \in [0, \infty)$ controls the contribution of the norm penalty to the total loss.

Key Difference: Parameter vs Representation Regularization

Type	What is Regularized	Effect	Example
Parameter regularization	Weight matrix $A$ or $W$	Makes weights sparse	L1/L2 weight decay
Representation regularization	Activation vector $h$	Makes activations sparse	Sparse autoencoders

Interpretation:

Parameter sparsity: Few connections between layers (network pruning)
Representation sparsity: Few neurons active at once (sparse coding)

Orthogonal Matching Pursuit

Equation 7.49 - Sparse approximation problem:

\[ \arg\min_h \|x - Wh\|^2 \quad \text{subject to} \quad \|h\|_0 < k \]

Goal: Find both $W$ and $h$ such that:

$Wh$ closely approximates $x$ (reconstruction accuracy)
$h$ is sparse — the number of nonzero elements in $h$ (denoted $\|h\|_0$) is less than $k$

Interpretation: Use $Wh$ to represent $x$ with only a few active components.

Orthogonal Matching Pursuit

When the columns of $W$ are orthogonal, this problem can be efficiently solved using orthogonal matching pursuit (OMP) algorithm.

Key insight: Orthogonality of basis vectors (columns of $W$) enables efficient sparse decomposition.

Real-World Applications

Note: The following table is generated by ChatGPT.

Category	Technique / Model	Representation Type	How it Uses Sparsity / Orthogonality	Real-World Example / Effect
🧠 Neuroscience & Vision	Sparse coding (Olshausen & Field, 1996)	Sparse	The brain represents visual inputs by activating only a few neurons for each stimulus.	Explains receptive fields in V1 visual cortex — neurons respond only to specific edges or orientations.
🧮 Machine Learning	LASSO Regression	Sparse	Adds $\lambda \\|w\\|_1$ penalty to make weights sparse, selecting only key features.	Feature selection in predictive models (finance, genomics, etc.).
🧩 Autoencoders	Sparse Autoencoder	Sparse	Adds regularizer $\Omega(h) = \\|h\\|_1$ or KL divergence to force sparse activations.	Used in image compression and pretraining deep networks (unsupervised learning).
🖼️ Image Processing / Compression	Dictionary Learning / K-SVD	Sparse	Represent an image as a combination of few learned basis patches (atoms).	JPEG-like compression, denoising, super-resolution.
🔊 Speech Processing	Non-negative Matrix Factorization (NMF)	Sparse + Nonnegative	Decomposes sound spectrograms into few additive components.	Source separation (e.g., separating voice from music).
🧠 Transform-based Compression	DCT (Discrete Cosine Transform)	Orthogonal	Projects signals onto orthogonal cosine basis vectors.	JPEG compression — energy concentrated in few coefficients.
📶 Signal Processing	Fourier Transform	Orthogonal	Represents time-domain signals in orthogonal sinusoidal basis.	Audio, RF, vibration analysis.
🖥️ Dimensionality Reduction	Principal Component Analysis (PCA)	Orthogonal	Finds orthogonal axes (principal components) that maximize variance.	Dimensionality reduction, data visualization.
🎙️ Source Separation	Independent Component Analysis (ICA)	Sparse-like / Orthogonal	Seeks statistically independent (often sparse) components.	Blind source separation (“cocktail party problem”).
🤖 CNN Filters	Orthogonal Regularization	Orthogonal	Forces convolution filters to be orthogonal to prevent redundancy.	Improves training stability and generalization.
💾 Transformers	Embedding Orthogonalization	Orthogonal or Near-orthogonal	Orthogonalizes embedding vectors for better separation in latent space.	Improves language model expressiveness and prevents collapse.

Source: Deep Learning Book, Chapter 7.10

--- title: "Chapter 7.10: Sparse Representations" author: "Chao Ma" date: "2025-10-27" categories: [deep learning, regularization, sparsity, sparse coding] description: "Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity" --- ## Overview Previous chapters focused on **parameter regularization** — constraining the weights $W$ of a model (see [Chapter 7.1.2: L1 Regularization](l1-regularization.qmd) and [Chapter 7.1.1: L2 Regularization](l2-regularization.qmd)). This chapter introduces **representation regularization** — constraining the activations $h$ (the learned representations). **Key distinction**: - **Parameter sparsity**: Makes the weight matrix sparse (few connections) - **Representation sparsity**: Makes the activation vector sparse (few active neurons) ## Parameter Regularization **Equation 7.46** - Linear system with parameter matrix: $$ \begin{array}{c} \begin{bmatrix} 18\\ 5\\ 15\\ -9\\ -3 \end{bmatrix} = \begin{bmatrix} 4 & 0 & 0 & -2 & 0\\ 0 & -1 & 0 & 3 & 0\\ 1 & 0 & 0 & 0 & 3\\ 0 & 5 & 0 & -1 & 0\\ 1 & 0 & 0 & -5 & 0 \end{bmatrix} \begin{bmatrix} 2\\ 3\\ -2\\ -5\\ 1 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m,\quad A \in \mathbb{R}^{m\times n},\quad x \in \mathbb{R}^n \end{array} $$ **L1 regularization on parameters**: $$ \Omega(A) = \|A\|_1 = \sum_{i,j} |A_{ij}| $$ **Effect**: Encourages sparsity in the parameter matrix $A$ itself, making many weights zero. ## Representation Sparsity **Equation 7.47** - Linear system with sparse representation: $$ \begin{array}{c} \begin{bmatrix} -14\\ 1\\ 3\\ 2\\ 23 \end{bmatrix} = \begin{bmatrix} 3 & -1 & 2 & -5 & 4 & 1\\ 4 & -2 & -3 & -1 & 3 & 0\\ 1 & 5 & 2 & -4 & 0 & 0\\ 3 & 4 & -3 & 0 & 2 & 0\\ -5 & -4 & 2 & 5 & -1 & 3 \end{bmatrix} \begin{bmatrix} 0\\ 0\\ 1\\ 0\\ 0\\ 0 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m, \quad B \in \mathbb{R}^{m\times n}, \quad h \in \mathbb{R}^n \end{array} $$ **L1 regularization on representation**: $$ \Omega(h) = \|h\|_1 = \sum_{i} |h_i| $$ **Equation 7.48** - Loss function with representation regularization: $$ \tilde{J}(\theta; x, y) = J(\theta; x, y) + \alpha \Omega(h) $$ where $\alpha \in [0, \infty)$ controls the contribution of the norm penalty to the total loss. ### Key Difference: Parameter vs Representation Regularization | Type | What is Regularized | Effect | Example | |------|-------------------|--------|---------| | **Parameter regularization** | Weight matrix $A$ or $W$ | Makes weights sparse | L1/L2 weight decay | | **Representation regularization** | Activation vector $h$ | Makes activations sparse | Sparse autoencoders | **Interpretation**: - **Parameter sparsity**: Few connections between layers (network pruning) - **Representation sparsity**: Few neurons active at once (sparse coding) ## Orthogonal Matching Pursuit **Equation 7.49** - Sparse approximation problem: $$ \arg\min_h \|x - Wh\|^2 \quad \text{subject to} \quad \|h\|_0 < k $$ **Goal**: Find both $W$ and $h$ such that: 1. $Wh$ closely approximates $x$ (reconstruction accuracy) 2. $h$ is sparse — the number of nonzero elements in $h$ (denoted $\|h\|_0$) is less than $k$ **Interpretation**: Use $Wh$ to represent $x$ with only a few active components. ::: {.callout-note} ## Orthogonal Matching Pursuit When the columns of $W$ are orthogonal, this problem can be efficiently solved using **orthogonal matching pursuit (OMP)** algorithm. **Key insight**: Orthogonality of basis vectors (columns of $W$) enables efficient sparse decomposition. ::: ## Real-World Applications **Note**: The following table is generated by ChatGPT. | Category | Technique / Model | Representation Type | How it Uses Sparsity / Orthogonality | Real-World Example / Effect | |----------|-------------------|---------------------|-------------------------------------|----------------------------| | 🧠 **Neuroscience & Vision** | **Sparse coding (Olshausen & Field, 1996)** | **Sparse** | The brain represents visual inputs by activating only a few neurons for each stimulus. | Explains receptive fields in V1 visual cortex — neurons respond only to specific edges or orientations. | | 🧮 **Machine Learning** | **LASSO Regression** | **Sparse** | Adds $\lambda \|w\|_1$ penalty to make weights sparse, selecting only key features. | Feature selection in predictive models (finance, genomics, etc.). | | 🧩 **Autoencoders** | **Sparse Autoencoder** | **Sparse** | Adds regularizer $\Omega(h) = \|h\|_1$ or KL divergence to force sparse activations. | Used in image compression and pretraining deep networks (unsupervised learning). | | 🖼️ **Image Processing / Compression** | **Dictionary Learning / K-SVD** | **Sparse** | Represent an image as a combination of few learned basis patches (atoms). | JPEG-like compression, denoising, super-resolution. | | 🔊 **Speech Processing** | **Non-negative Matrix Factorization (NMF)** | **Sparse + Nonnegative** | Decomposes sound spectrograms into few additive components. | Source separation (e.g., separating voice from music). | | 🧠 **Transform-based Compression** | **DCT (Discrete Cosine Transform)** | **Orthogonal** | Projects signals onto orthogonal cosine basis vectors. | JPEG compression — energy concentrated in few coefficients. | | 📶 **Signal Processing** | **Fourier Transform** | **Orthogonal** | Represents time-domain signals in orthogonal sinusoidal basis. | Audio, RF, vibration analysis. | | 🖥️ **Dimensionality Reduction** | **Principal Component Analysis (PCA)** | **Orthogonal** | Finds orthogonal axes (principal components) that maximize variance. | Dimensionality reduction, data visualization. | | 🎙️ **Source Separation** | **Independent Component Analysis (ICA)** | **Sparse-like / Orthogonal** | Seeks statistically independent (often sparse) components. | Blind source separation ("cocktail party problem"). | | 🤖 **CNN Filters** | **Orthogonal Regularization** | **Orthogonal** | Forces convolution filters to be orthogonal to prevent redundancy. | Improves training stability and generalization. | | 💾 **Transformers** | **Embedding Orthogonalization** | **Orthogonal or Near-orthogonal** | Orthogonalizes embedding vectors for better separation in latent space. | Improves language model expressiveness and prevents collapse. | --- *Source: Deep Learning Book, Chapter 7.10*