Chapter 7.10: Sparse Representations

deep learning
regularization
sparsity
sparse coding
Regularizing learned representations rather than model parameters: the difference between parameter sparsity and representation sparsity
Author

Chao Ma

Published

October 27, 2025

Overview

Previous chapters focused on parameter regularization — constraining the weights \(W\) of a model (see Chapter 7.1.2: L1 Regularization and Chapter 7.1.1: L2 Regularization).

This chapter introduces representation regularization — constraining the activations \(h\) (the learned representations).

Key distinction:

  • Parameter sparsity: Makes the weight matrix sparse (few connections)
  • Representation sparsity: Makes the activation vector sparse (few active neurons)

Parameter Regularization

Equation 7.46 - Linear system with parameter matrix:

\[ \begin{array}{c} \begin{bmatrix} 18\\ 5\\ 15\\ -9\\ -3 \end{bmatrix} = \begin{bmatrix} 4 & 0 & 0 & -2 & 0\\ 0 & -1 & 0 & 3 & 0\\ 1 & 0 & 0 & 0 & 3\\ 0 & 5 & 0 & -1 & 0\\ 1 & 0 & 0 & -5 & 0 \end{bmatrix} \begin{bmatrix} 2\\ 3\\ -2\\ -5\\ 1 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m,\quad A \in \mathbb{R}^{m\times n},\quad x \in \mathbb{R}^n \end{array} \]

L1 regularization on parameters:

\[ \Omega(A) = \|A\|_1 = \sum_{i,j} |A_{ij}| \]

Effect: Encourages sparsity in the parameter matrix \(A\) itself, making many weights zero.

Representation Sparsity

Equation 7.47 - Linear system with sparse representation:

\[ \begin{array}{c} \begin{bmatrix} -14\\ 1\\ 3\\ 2\\ 23 \end{bmatrix} = \begin{bmatrix} 3 & -1 & 2 & -5 & 4 & 1\\ 4 & -2 & -3 & -1 & 3 & 0\\ 1 & 5 & 2 & -4 & 0 & 0\\ 3 & 4 & -3 & 0 & 2 & 0\\ -5 & -4 & 2 & 5 & -1 & 3 \end{bmatrix} \begin{bmatrix} 0\\ 0\\ 1\\ 0\\ 0\\ 0 \end{bmatrix} \\[6pt] y \in \mathbb{R}^m, \quad B \in \mathbb{R}^{m\times n}, \quad h \in \mathbb{R}^n \end{array} \]

L1 regularization on representation:

\[ \Omega(h) = \|h\|_1 = \sum_{i} |h_i| \]

Equation 7.48 - Loss function with representation regularization:

\[ \tilde{J}(\theta; x, y) = J(\theta; x, y) + \alpha \Omega(h) \]

where \(\alpha \in [0, \infty)\) controls the contribution of the norm penalty to the total loss.

Key Difference: Parameter vs Representation Regularization

Type What is Regularized Effect Example
Parameter regularization Weight matrix \(A\) or \(W\) Makes weights sparse L1/L2 weight decay
Representation regularization Activation vector \(h\) Makes activations sparse Sparse autoencoders

Interpretation:

  • Parameter sparsity: Few connections between layers (network pruning)
  • Representation sparsity: Few neurons active at once (sparse coding)

Orthogonal Matching Pursuit

Equation 7.49 - Sparse approximation problem:

\[ \arg\min_h \|x - Wh\|^2 \quad \text{subject to} \quad \|h\|_0 < k \]

Goal: Find both \(W\) and \(h\) such that:

  1. \(Wh\) closely approximates \(x\) (reconstruction accuracy)
  2. \(h\) is sparse — the number of nonzero elements in \(h\) (denoted \(\|h\|_0\)) is less than \(k\)

Interpretation: Use \(Wh\) to represent \(x\) with only a few active components.

NoteOrthogonal Matching Pursuit

When the columns of \(W\) are orthogonal, this problem can be efficiently solved using orthogonal matching pursuit (OMP) algorithm.

Key insight: Orthogonality of basis vectors (columns of \(W\)) enables efficient sparse decomposition.

Real-World Applications

Note: The following table is generated by ChatGPT.

Category Technique / Model Representation Type How it Uses Sparsity / Orthogonality Real-World Example / Effect
🧠 Neuroscience & Vision Sparse coding (Olshausen & Field, 1996) Sparse The brain represents visual inputs by activating only a few neurons for each stimulus. Explains receptive fields in V1 visual cortex — neurons respond only to specific edges or orientations.
🧮 Machine Learning LASSO Regression Sparse Adds \(\lambda \|w\|_1\) penalty to make weights sparse, selecting only key features. Feature selection in predictive models (finance, genomics, etc.).
🧩 Autoencoders Sparse Autoencoder Sparse Adds regularizer \(\Omega(h) = \|h\|_1\) or KL divergence to force sparse activations. Used in image compression and pretraining deep networks (unsupervised learning).
🖼️ Image Processing / Compression Dictionary Learning / K-SVD Sparse Represent an image as a combination of few learned basis patches (atoms). JPEG-like compression, denoising, super-resolution.
🔊 Speech Processing Non-negative Matrix Factorization (NMF) Sparse + Nonnegative Decomposes sound spectrograms into few additive components. Source separation (e.g., separating voice from music).
🧠 Transform-based Compression DCT (Discrete Cosine Transform) Orthogonal Projects signals onto orthogonal cosine basis vectors. JPEG compression — energy concentrated in few coefficients.
📶 Signal Processing Fourier Transform Orthogonal Represents time-domain signals in orthogonal sinusoidal basis. Audio, RF, vibration analysis.
🖥️ Dimensionality Reduction Principal Component Analysis (PCA) Orthogonal Finds orthogonal axes (principal components) that maximize variance. Dimensionality reduction, data visualization.
🎙️ Source Separation Independent Component Analysis (ICA) Sparse-like / Orthogonal Seeks statistically independent (often sparse) components. Blind source separation (“cocktail party problem”).
🤖 CNN Filters Orthogonal Regularization Orthogonal Forces convolution filters to be orthogonal to prevent redundancy. Improves training stability and generalization.
💾 Transformers Embedding Orthogonalization Orthogonal or Near-orthogonal Orthogonalizes embedding vectors for better separation in latent space. Improves language model expressiveness and prevents collapse.

Source: Deep Learning Book, Chapter 7.10