Chapter 7.9: Parameter Tying and Parameter Sharing
Overview
Neural networks can benefit from constraints on parameters in two distinct ways:
- Parameter Tying: Encourages parameters to be similar through regularization
- Parameter Sharing: Forces parameters to be identical by design
Both approaches reduce effective model capacity and improve generalization, but they differ fundamentally in how strictly they enforce parameter relationships.
Parameter Tying
Definition
Parameter tying constrains parameters of different models or layers to be similar by adding a penalty term to the loss function.
Mathematical Formulation
\[ \Omega(w^{(A)}, w^{(B)}) = \|w^{(A)} - w^{(B)}\|_2^2 \]
How it works:
- Add this penalty term to the loss function
- Forces two models (or layers) to learn similar parameters
- The parameters are still independent, but regularization encourages similarity
This is a soft constraint that allows some deviation while encouraging parameter alignment.
Real-World Applications
Note: The following table is generated by ChatGPT.
| Application | Mechanism | Purpose | Reference |
|---|---|---|---|
| Word Embedding Tying | Input embedding matrix and output softmax matrix tied: \(W_{\text{out}} = E^T\) | Reduce parameters; consistent embedding space | Press & Wolf, 2017 |
| Autoencoder | Decoder weights tied to encoder transpose: \(W_{\text{dec}} = W_{\text{enc}}^T\) | Regularize; stabilize training; mimic PCA | Hinton & Salakhutdinov, 2006 |
| Multi-task Learning | Different tasks’ parameters constrained to be similar: \(\Omega = \|w^{(A)} - w^{(B)}\|^2\) | Encourage knowledge sharing between tasks | Caruana, 1997 |
| Knowledge Distillation | Student layers tied to teacher via loss constraint: \(\|h_s^{(l)} - h_t^{(l)}\|^2\) | Transfer intermediate representations | Sanh et al., 2019 — DistilBERT |
Parameter Sharing
Definition
Parameter sharing uses the exact same set of parameters across multiple locations or time steps.
CNN Example
The same kernel (set of weights) is applied across all spatial locations of the input:
\[ y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v} \]
Benefits:
- Pattern detection: Detects the same pattern (e.g., edge or texture) anywhere in the image
- Parameter reduction: Dramatically reduces the number of parameters
- Translation equivariance: Output shifts when input shifts
This is a hard constraint where parameters are identical by design, not just similar.
Real-World Applications
Note: The following table is generated by ChatGPT.
| Application | Mechanism | Purpose | Reference |
|---|---|---|---|
| CNN | Same kernel slides across all spatial positions: \(y_{i,j} = \sum_{u,v} w_{u,v} x_{i+u, j+v}\) | Detect same pattern anywhere; reduce parameters; translation equivariance | LeCun et al., 1998 — LeNet |
| RNN / LSTM / GRU | Same weights used at each time step: \(h_t = f(W_h h_{t-1} + W_x x_t)\) | Temporal consistency; handle variable-length sequences | Hochreiter & Schmidhuber, 1997 |
| Transformer (ALBERT) | All encoder layers share parameters: \(\theta_1 = \theta_2 = \dots = \theta_L\) | Reduce memory; efficient deep sharing | Lan et al., 2020 — ALBERT |
| Siamese / Twin Networks | Two (or more) branches share all parameters: \(f_\theta(x_1), f_\theta(x_2)\) | Compare similarity; representation consistency | Bromley et al., 1993 — Siamese Nets |
Source: Deep Learning Book, Chapter 7.9