Chapter 10.5: Deep Recurrent Networks

Deep Learning
RNN
Sequence Modeling
Deep RNN
Author

Chao Ma

Published

December 9, 2025

Deep recurrent networks extend basic RNNs by introducing depth through multiple layers of computation. The experimental literature on deep RNNs emphasizes three main architectural patterns for adding depth:


1. Hierarchical Hidden States

The hidden state at each time step is decomposed into multiple hierarchical layers, forming a vertical stack of RNN cells.

At time step \(t\), the computation proceeds through layers:

\[ \begin{align*} h_t^{(1)} &= f_1(h_{t-1}^{(1)}, x_t) \\ h_t^{(2)} &= f_2(h_{t-1}^{(2)}, h_t^{(1)}) \\ &\vdots \\ h_t^{(L)} &= f_L(h_{t-1}^{(L)}, h_t^{(L-1)}) \\ \end{align*} \]

Each layer \(\ell\) receives: - the previous hidden state from the same layer \(h_{t-1}^{(\ell)}\) - the current hidden state from the layer below \(h_t^{(\ell-1)}\)

This creates a 2D grid of computations over time (horizontal) and depth (vertical).

Key benefit: Multiple levels of temporal abstraction—lower layers capture fine-grained patterns, upper layers capture longer-range dependencies.

Hierarchical Hidden States

2. Deep Transition RNN

Instead of using simple affine transformations followed by activation functions, the core RNN transformations are replaced by multilayer perceptrons (MLPs).

The three key transformations become:

  1. Input-to-hidden: \(\text{MLP}_x(x_t)\)
  2. Hidden-to-hidden (recurrent): \(\text{MLP}_h(h_{t-1})\)
  3. Hidden-to-output: \(\text{MLP}_o(h_t)\)

Each MLP is a small feedforward network with its own hidden layers, adding depth within each time step.

Key benefit: Richer transformations at each step—capable of learning more complex nonlinear mappings between successive hidden states.

Deep Transition RNN

3. Deep Transition RNN with Skip Connections

This architecture extends the deep transition RNN by adding skip connections (residual connections) around the deep MLP blocks.

At each time step, the input to an MLP is added directly to its output:

\[ h_t = \text{MLP}_h(h_{t-1}) + h_{t-1} \]

This follows the residual learning principle popularized by ResNets.

Key benefit: Skip connections enable gradient flow through very deep networks by providing shortcut paths that bypass multiple nonlinear transformations. This mitigates vanishing gradients and allows training of much deeper RNN architectures.

Deep Transition RNN with Skip Connections

Comparison and Trade-offs

Architecture Depth Location Parameters Gradient Flow Use Case
Hierarchical Vertical stacking High Standard BPTT Multi-scale temporal patterns
Deep Transition Within transformations Very high Challenging Complex state transitions
Deep Transition + Skip Within transformations Very high Improved Very deep networks

General principle: Adding depth to RNNs increases expressiveness but also increases the risk of optimization difficulties. Skip connections are essential for training very deep recurrent architectures.


These three patterns can be combined—e.g., hierarchical RNNs where each layer uses deep transitions with skip connections—to build highly expressive sequence models at the cost of increased computational requirements.