Chapter 7.9: Parameter Tying and Parameter Sharing

deep learning
regularization
parameter sharing
CNN
RNN
Two strategies for reducing parameters and improving generalization: encouraging similarity vs. enforcing identity
Author

Chao Ma

Published

October 27, 2025

Overview

Neural networks can benefit from constraints on parameters in two distinct ways:

  1. Parameter Tying: Encourages parameters to be similar through regularization
  2. Parameter Sharing: Forces parameters to be identical by design

Both approaches reduce effective model capacity and improve generalization, but they differ fundamentally in how strictly they enforce parameter relationships.

Parameter Tying

Definition

Parameter tying constrains parameters of different models or layers to be similar by adding a penalty term to the loss function.

Mathematical Formulation

\[ \Omega(w^{(A)}, w^{(B)}) = \|w^{(A)} - w^{(B)}\|_2^2 \]

How it works:

  • Add this penalty term to the loss function
  • Forces two models (or layers) to learn similar parameters
  • The parameters are still independent, but regularization encourages similarity
NoteNature of Constraint

This is a soft constraint that allows some deviation while encouraging parameter alignment.

Real-World Applications

Note: The following table is generated by ChatGPT.

Application Mechanism Purpose Reference
Word Embedding Tying Input embedding matrix and output softmax matrix tied: \(W_{\text{out}} = E^T\) Reduce parameters; consistent embedding space Press & Wolf, 2017
Autoencoder Decoder weights tied to encoder transpose: \(W_{\text{dec}} = W_{\text{enc}}^T\) Regularize; stabilize training; mimic PCA Hinton & Salakhutdinov, 2006
Multi-task Learning Different tasks’ parameters constrained to be similar: \(\Omega = \|w^{(A)} - w^{(B)}\|^2\) Encourage knowledge sharing between tasks Caruana, 1997
Knowledge Distillation Student layers tied to teacher via loss constraint: \(\|h_s^{(l)} - h_t^{(l)}\|^2\) Transfer intermediate representations Sanh et al., 2019 — DistilBERT

Parameter Sharing

Definition

Parameter sharing uses the exact same set of parameters across multiple locations or time steps.

CNN Example

The same kernel (set of weights) is applied across all spatial locations of the input:

\[ y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v} \]

Benefits:

  1. Pattern detection: Detects the same pattern (e.g., edge or texture) anywhere in the image
  2. Parameter reduction: Dramatically reduces the number of parameters
  3. Translation equivariance: Output shifts when input shifts
ImportantNature of Constraint

This is a hard constraint where parameters are identical by design, not just similar.

Real-World Applications

Note: The following table is generated by ChatGPT.

Application Mechanism Purpose Reference
CNN Same kernel slides across all spatial positions: \(y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v}\) Detect same pattern anywhere; reduce parameters; translation equivariance LeCun et al., 1998 — LeNet
RNN / LSTM / GRU Same weights used at each time step: \(h_t = f(W_h h_{t-1} + W_x x_t)\) Temporal consistency; handle variable-length sequences Hochreiter & Schmidhuber, 1997
Transformer (ALBERT) All encoder layers share parameters: \(\theta_1 = \theta_2 = \dots = \theta_L\) Reduce memory; efficient deep sharing Lan et al., 2020 — ALBERT
Siamese / Twin Networks Two (or more) branches share all parameters: \(f_\theta(x_1), f_\theta(x_2)\) Compare similarity; representation consistency Bromley et al., 1993 — Siamese Nets

Source: Deep Learning Book, Chapter 7.9