Goodfellow Deep Learning — Chapter 7.9: Parameter Tying and Parameter Sharing

deep learning

regularization

parameter sharing

CNN

RNN

Two strategies for reducing parameters and improving generalization: encouraging similarity vs. enforcing identity

Author

Chao Ma

Published

October 27, 2025

Overview

Neural networks can benefit from constraints on parameters in two distinct ways:

Parameter Tying: Encourages parameters to be similar through regularization
Parameter Sharing: Forces parameters to be identical by design

Both approaches reduce effective model capacity and improve generalization, but they differ fundamentally in how strictly they enforce parameter relationships.

Parameter Tying

Definition

Parameter tying constrains parameters of different models or layers to be similar by adding a penalty term to the loss function.

Mathematical Formulation

\[ \Omega(w^{(A)}, w^{(B)}) = \|w^{(A)} - w^{(B)}\|_2^2 \]

How it works:

Add this penalty term to the loss function
Forces two models (or layers) to learn similar parameters
The parameters are still independent, but regularization encourages similarity

Nature of Constraint

This is a soft constraint that allows some deviation while encouraging parameter alignment.

Real-World Applications

Note: The following table is generated by ChatGPT.

Application	Mechanism	Purpose	Reference
Word Embedding Tying	Input embedding matrix and output softmax matrix tied: $W_{\text{out}} = E^T$	Reduce parameters; consistent embedding space	Press & Wolf, 2017
Autoencoder	Decoder weights tied to encoder transpose: $W_{\text{dec}} = W_{\text{enc}}^T$	Regularize; stabilize training; mimic PCA	Hinton & Salakhutdinov, 2006
Multi-task Learning	Different tasks’ parameters constrained to be similar: $\Omega = \\|w^{(A)} - w^{(B)}\\|^2$	Encourage knowledge sharing between tasks	Caruana, 1997
Knowledge Distillation	Student layers tied to teacher via loss constraint: $\\|h_s^{(l)} - h_t^{(l)}\\|^2$	Transfer intermediate representations	Sanh et al., 2019 — DistilBERT

Parameter Sharing

Definition

Parameter sharing uses the exact same set of parameters across multiple locations or time steps.

CNN Example

The same kernel (set of weights) is applied across all spatial locations of the input:

\[ y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v} \]

Benefits:

Pattern detection: Detects the same pattern (e.g., edge or texture) anywhere in the image
Parameter reduction: Dramatically reduces the number of parameters
Translation equivariance: Output shifts when input shifts

Nature of Constraint

This is a hard constraint where parameters are identical by design, not just similar.

Real-World Applications

Note: The following table is generated by ChatGPT.

Application	Mechanism	Purpose	Reference
CNN	Same kernel slides across all spatial positions: $y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v}$	Detect same pattern anywhere; reduce parameters; translation equivariance	LeCun et al., 1998 — LeNet
RNN / LSTM / GRU	Same weights used at each time step: $h_t = f(W_h h_{t-1} + W_x x_t)$	Temporal consistency; handle variable-length sequences	Hochreiter & Schmidhuber, 1997
Transformer (ALBERT)	All encoder layers share parameters: $\theta_1 = \theta_2 = \dots = \theta_L$	Reduce memory; efficient deep sharing	Lan et al., 2020 — ALBERT
Siamese / Twin Networks	Two (or more) branches share all parameters: $f_\theta(x_1), f_\theta(x_2)$	Compare similarity; representation consistency	Bromley et al., 1993 — Siamese Nets

Source: Deep Learning Book, Chapter 7.9

--- title: "Goodfellow Deep Learning — Chapter 7.9: Parameter Tying and Parameter Sharing" author: "Chao Ma" date: "2025-10-27" categories: [deep learning, regularization, parameter sharing, CNN, RNN] description: "Two strategies for reducing parameters and improving generalization: encouraging similarity vs. enforcing identity" --- ## Overview Neural networks can benefit from **constraints on parameters** in two distinct ways: 1. **Parameter Tying**: Encourages parameters to be similar through regularization 2. **Parameter Sharing**: Forces parameters to be identical by design Both approaches reduce effective model capacity and improve generalization, but they differ fundamentally in how strictly they enforce parameter relationships. ## Parameter Tying ### Definition **Parameter tying** constrains parameters of different models or layers to be similar by adding a penalty term to the loss function. ### Mathematical Formulation $$ \Omega(w^{(A)}, w^{(B)}) = \|w^{(A)} - w^{(B)}\|_2^2 $$ **How it works**: - Add this penalty term to the loss function - Forces two models (or layers) to learn similar parameters - The parameters are still independent, but regularization encourages similarity ::: {.callout-note} ## Nature of Constraint This is a **soft constraint** that allows some deviation while encouraging parameter alignment. ::: ### Real-World Applications **Note**: The following table is generated by ChatGPT. | Application | Mechanism | Purpose | Reference | |------------|-----------|---------|-----------| | **Word Embedding Tying** | Input embedding matrix and output softmax matrix tied: $W_{\text{out}} = E^T$ | Reduce parameters; consistent embedding space | Press & Wolf, 2017 | | **Autoencoder** | Decoder weights tied to encoder transpose: $W_{\text{dec}} = W_{\text{enc}}^T$ | Regularize; stabilize training; mimic PCA | Hinton & Salakhutdinov, 2006 | | **Multi-task Learning** | Different tasks' parameters constrained to be similar: $\Omega = \|w^{(A)} - w^{(B)}\|^2$ | Encourage knowledge sharing between tasks | Caruana, 1997 | | **Knowledge Distillation** | Student layers tied to teacher via loss constraint: $\|h_s^{(l)} - h_t^{(l)}\|^2$ | Transfer intermediate representations | Sanh et al., 2019 — DistilBERT | ## Parameter Sharing ### Definition **Parameter sharing** uses the exact same set of parameters across multiple locations or time steps. ### CNN Example The same kernel (set of weights) is applied across all spatial locations of the input: $$ y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v} $$ **Benefits**: 1. **Pattern detection**: Detects the same pattern (e.g., edge or texture) anywhere in the image 2. **Parameter reduction**: Dramatically reduces the number of parameters 3. **Translation equivariance**: Output shifts when input shifts ::: {.callout-important} ## Nature of Constraint This is a **hard constraint** where parameters are identical by design, not just similar. ::: ### Real-World Applications **Note**: The following table is generated by ChatGPT. | Application | Mechanism | Purpose | Reference | |------------|-----------|---------|-----------| | **CNN** | Same kernel slides across all spatial positions: $y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v}$ | Detect same pattern anywhere; reduce parameters; translation equivariance | LeCun et al., 1998 — LeNet | | **RNN / LSTM / GRU** | Same weights used at each time step: $h_t = f(W_h h_{t-1} + W_x x_t)$ | Temporal consistency; handle variable-length sequences | Hochreiter & Schmidhuber, 1997 | | **Transformer (ALBERT)** | All encoder layers share parameters: $\theta_1 = \theta_2 = \dots = \theta_L$ | Reduce memory; efficient deep sharing | Lan et al., 2020 — ALBERT | | **Siamese / Twin Networks** | Two (or more) branches share all parameters: $f_\theta(x_1), f_\theta(x_2)$ | Compare similarity; representation consistency | Bromley et al., 1993 — Siamese Nets | --- *Source: Deep Learning Book, Chapter 7.9*