Chapter 10.7: The Challenge of Long-Term Dependencies

Deep Learning
RNN
Vanishing Gradients
Exploding Gradients
Author

Chao Ma

Published

December 10, 2025

The fundamental challenge of long-term dependencies is not representational capacity but the difficulty of training recurrent networks.

During backpropagation through time, gradients must be propagated across many steps, causing:

Nonlinear function composition effects

The figure shows how repeatedly applying the same nonlinear function causes saturation or steep amplification. Although the x-axis is not time, these compositions mirror the effect of many RNN time steps, explaining why gradients vanish or explode during long-term dependency learning.


Mathematical Analysis

\[ h^{t} = W h^{t-1} \tag{10.36} \]

Hidden state after \(t\) steps can be written as the repeated application of the same linear transformation:

\[ h^{t} = W^{t} h^{0} \tag{10.37} \]

showing that long-term behavior is governed by powers of the transition matrix.

\[ W = Q\Lambda Q^{\top} \tag{10.38} \]

We can analyze its dynamics in the eigenbasis if \(W\) is diagonalizable:

\[ h^{t} = Q \Lambda^{t} Q^{\top} h^{0} \tag{10.39} \]

Each eigenvalue raised to the \(t\)-th power either shrinks (vanishing) or grows (exploding), explaining instability over long time horizons.

Eigenvalue dynamics over time