Deep Learning Book 6.4: Architecture Design - Depth vs Width

Deep Learning

Neural Networks

Architecture Design

Author

Chao Ma

Published

September 30, 2025

This recap of Deep Learning Chapter 6.4 explores how network architecture—depth versus width—fundamentally shapes what neural networks can learn and how efficiently they learn it.

📓 For a deeper dive with additional exercises and analysis, see the complete notebook on GitHub.

The Architecture Question: Deep or Wide?

When designing a neural network, one of the most fundamental decisions is choosing between depth (many layers) and width (many units per layer). Should you build a shallow network with many units, or a deep network with fewer units per layer?

The answer reveals something profound about how neural networks represent functions: deep networks can achieve exponentially greater expressiveness than shallow networks with the same number of parameters. This isn’t just theoretical—it has practical implications for model efficiency and performance.

Quick Reference: Understanding Depth vs Width

For context on the fundamental concepts of network architecture, see the Architecture Design summary.

Key insight: A deep ReLU network with $n$ units per layer and depth $L$ can create $\mathcal{O}(n^L)$ distinct linear regions in the input space. A shallow network would need exponentially many units ($\mathcal{O}(n^L)$ units in a single layer) to achieve the same expressiveness.

Architecture	Characteristic	Advantage	Challenge
Deep (many layers)	Hierarchical feature reuse	Exponential expressiveness with fewer parameters	Harder to optimize (vanishing/exploding gradients)
Wide (many units per layer)	Increased capacity per layer	Easier optimization	Parameter inefficient; requires exponentially more units

🔬 Experiment: Shallow vs Deep Network Comparison

Let’s explore whether depth provides an advantage in practice by comparing two networks: - Shallow Network: 1 hidden layer with 128 units - Deep Network: 3 hidden layers (16 → 8 → output)

Both networks are trained on the same regression task: $y = \sin^2(x) + x^3$

Show code

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Configure plotting
plt.rcParams['figure.facecolor'] = 'white'
plt.rcParams['axes.facecolor'] = 'white'
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

print("✓ Setup complete")

✓ Setup complete

Step 1: Generate Training and Test Data

Show code

# Training data
x_train = np.random.rand(200, 1)
y_train = np.square(np.sin(x_train)) + np.power(x_train, 3)

# Test data
x_test = np.random.rand(100, 1)
y_test = np.square(np.sin(x_test)) + np.power(x_test, 3)

# Convert to PyTorch tensors
x_train_tensor = torch.FloatTensor(x_train)
y_train_tensor = torch.FloatTensor(y_train)
x_test_tensor = torch.FloatTensor(x_test)
y_test_tensor = torch.FloatTensor(y_test)

print(f"Training samples: {len(x_train)}")
print(f"Test samples: {len(x_test)}")
print(f"Input range: [{x_train.min():.2f}, {x_train.max():.2f}]")
print(f"Target range: [{y_train.min():.2f}, {y_train.max():.2f}]")

Training samples: 200
Test samples: 100
Input range: [0.01, 0.99]
Target range: [0.00, 1.66]

Step 2: Define Model Architectures

Show code

# Shallow model: 1 hidden layer with 128 units
shallow_model = nn.Sequential(
    nn.Linear(1, 128),
    nn.ReLU(),
    nn.Linear(128, 1)
)

# Deep model: 3 hidden layers (16 → 8 → output)
deep_model = nn.Sequential(
    nn.Linear(1, 16),
    nn.ReLU(),
    nn.Linear(16, 8),
    nn.ReLU(),
    nn.Linear(8, 1)
)

print("✓ Models created")
print(f"\nShallow model architecture:")
print(shallow_model)
print(f"\nDeep model architecture:")
print(deep_model)

✓ Models created

Shallow model architecture:
Sequential(
  (0): Linear(in_features=1, out_features=128, bias=True)
  (1): ReLU()
  (2): Linear(in_features=128, out_features=1, bias=True)
)

Deep model architecture:
Sequential(
  (0): Linear(in_features=1, out_features=16, bias=True)
  (1): ReLU()
  (2): Linear(in_features=16, out_features=8, bias=True)
  (3): ReLU()
  (4): Linear(in_features=8, out_features=1, bias=True)
)

Step 3: Count Parameters

How many trainable parameters does each architecture use?

Show code

def count_parameters(model):
    """Count total trainable parameters in a model"""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

shallow_params = count_parameters(shallow_model)
deep_params = count_parameters(deep_model)

print("Parameter Counts:")
print("-" * 50)
print(f"Shallow model (1 layer × 128 units): {shallow_params:,} parameters")
print(f"Deep model (3 layers):                {deep_params:,} parameters")
print("-" * 50)
print(f"Ratio (shallow/deep): {shallow_params/deep_params:.2f}x")

# Visualize parameter counts
fig, ax = plt.subplots(figsize=(8, 5))
models = ['Shallow\n(1×128)', 'Deep\n(3 layers)']
params = [shallow_params, deep_params]
colors = ['#ff7f0e', '#1f77b4']

bars = ax.bar(models, params, color=colors, alpha=0.7, edgecolor='black')
ax.set_ylabel('Number of Parameters', fontsize=12)
ax.set_title('Model Parameter Comparison', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, param in zip(bars, params):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{param:,}',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

Parameter Counts:
--------------------------------------------------
Shallow model (1 layer × 128 units): 385 parameters
Deep model (3 layers):                177 parameters
--------------------------------------------------
Ratio (shallow/deep): 2.18x

Step 4: Train Both Models

Show code

# Training configuration
n_epochs = 500
learning_rate = 0.01
loss_fn = nn.MSELoss()

# Track training history
history = {
    'Shallow': {'train_loss': [], 'test_loss': []},
    'Deep': {'train_loss': [], 'test_loss': []}
}

models = {
    'Shallow': shallow_model,
    'Deep': deep_model
}

# Train each model
for name, model in models.items():
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(n_epochs):
        # Training
        model.train()
        y_pred = model(x_train_tensor)
        loss = loss_fn(y_pred, y_train_tensor)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        history[name]['train_loss'].append(loss.item())

        # Evaluation on test set
        model.eval()
        with torch.no_grad():
            y_test_pred = model(x_test_tensor)
            test_loss = loss_fn(y_test_pred, y_test_tensor).item()
            history[name]['test_loss'].append(test_loss)

    print(f"✓ {name} model trained")

print("\n✓ Training complete")

✓ Shallow model trained
✓ Deep model trained

✓ Training complete

Step 5: Compare Model Performance

Show code

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = {'Shallow': '#ff7f0e', 'Deep': '#1f77b4'}

# Plot training loss
for name, data in history.items():
    axes[0].plot(data['train_loss'], label=name, linewidth=2, color=colors[name])

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Training Loss (MSE)', fontsize=12)
axes[0].set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Plot test loss
for name, data in history.items():
    axes[1].plot(data['test_loss'], label=name, linewidth=2, color=colors[name])

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Test Loss (MSE)', fontsize=12)
axes[1].set_title('Test Loss Comparison', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

# Print final metrics
print("\nFinal Performance (after {} epochs):".format(n_epochs))
print("-" * 70)
print(f"{'Model':<15} {'Parameters':<15} {'Train Loss':<15} {'Test Loss':<15}")
print("-" * 70)
for name in models.keys():
    params = count_parameters(models[name])
    train_loss = history[name]['train_loss'][-1]
    test_loss = history[name]['test_loss'][-1]
    print(f"{name:<15} {params:<15,} {train_loss:<15.6f} {test_loss:<15.6f}")


Final Performance (after 500 epochs):
----------------------------------------------------------------------
Model           Parameters      Train Loss      Test Loss      
----------------------------------------------------------------------
Shallow         385             0.000017        0.000022       
Deep            177             0.000012        0.000013

This experiment demonstrates the practical implications of depth versus width in neural network architecture design, showing how deeper networks can achieve competitive performance with fewer parameters.

--- title: "Deep Learning Book 6.4: Architecture Design - Depth vs Width" author: "Chao Ma" date: "2025-09-30" categories: ["Deep Learning", "Neural Networks", "Architecture Design"] code-fold: true code-summary: "Show code" --- *This recap of Deep Learning Chapter 6.4 explores how network architecture—depth versus width—fundamentally shapes what neural networks can learn and how efficiently they learn it.* 📓 **For a deeper dive with additional exercises and analysis**, see the [complete notebook on GitHub](https://github.com/ickma2311/foundations/blob/main/deep_learning/chapter6/6.4/exercises.ipynb). ## The Architecture Question: Deep or Wide? When designing a neural network, one of the most fundamental decisions is choosing between depth (many layers) and width (many units per layer). Should you build a shallow network with many units, or a deep network with fewer units per layer? The answer reveals something profound about how neural networks represent functions: **deep networks can achieve exponentially greater expressiveness than shallow networks with the same number of parameters**. This isn't just theoretical—it has practical implications for model efficiency and performance. ### Quick Reference: Understanding Depth vs Width For context on the fundamental concepts of network architecture, see the [Architecture Design summary](https://github.com/ickma2311/foundations/blob/main/deep_learning/chapter6/6.4/architecture_design.md). **Key insight**: A deep ReLU network with $n$ units per layer and depth $L$ can create $\mathcal{O}(n^L)$ distinct linear regions in the input space. A shallow network would need exponentially many units ($\mathcal{O}(n^L)$ units in a single layer) to achieve the same expressiveness. | **Architecture** | **Characteristic** | **Advantage** | **Challenge** | |------------------|-------------------|---------------|---------------| | **Deep** (many layers) | Hierarchical feature reuse | Exponential expressiveness with fewer parameters | Harder to optimize (vanishing/exploding gradients) | | **Wide** (many units per layer) | Increased capacity per layer | Easier optimization | Parameter inefficient; requires exponentially more units | ## 🔬 Experiment: Shallow vs Deep Network Comparison Let's explore whether depth provides an advantage in practice by comparing two networks: - **Shallow Network**: 1 hidden layer with 128 units - **Deep Network**: 3 hidden layers (16 → 8 → output) Both networks are trained on the same regression task: $y = \sin^2(x) + x^3$ ```{python} import numpy as np import matplotlib.pyplot as plt import torch import torch.nn as nn # Set random seed for reproducibility np.random.seed(42) torch.manual_seed(42) # Configure plotting plt.rcParams['figure.facecolor'] = 'white' plt.rcParams['axes.facecolor'] = 'white' plt.rcParams['axes.grid'] = True plt.rcParams['grid.alpha'] = 0.3 print("✓ Setup complete") ``` ### Step 1: Generate Training and Test Data ```{python} # Training data x_train = np.random.rand(200, 1) y_train = np.square(np.sin(x_train)) + np.power(x_train, 3) # Test data x_test = np.random.rand(100, 1) y_test = np.square(np.sin(x_test)) + np.power(x_test, 3) # Convert to PyTorch tensors x_train_tensor = torch.FloatTensor(x_train) y_train_tensor = torch.FloatTensor(y_train) x_test_tensor = torch.FloatTensor(x_test) y_test_tensor = torch.FloatTensor(y_test) print(f"Training samples: {len(x_train)}") print(f"Test samples: {len(x_test)}") print(f"Input range: [{x_train.min():.2f}, {x_train.max():.2f}]") print(f"Target range: [{y_train.min():.2f}, {y_train.max():.2f}]") ``` ### Step 2: Define Model Architectures ```{python} # Shallow model: 1 hidden layer with 128 units shallow_model = nn.Sequential( nn.Linear(1, 128), nn.ReLU(), nn.Linear(128, 1) ) # Deep model: 3 hidden layers (16 → 8 → output) deep_model = nn.Sequential( nn.Linear(1, 16), nn.ReLU(), nn.Linear(16, 8), nn.ReLU(), nn.Linear(8, 1) ) print("✓ Models created") print(f"\nShallow model architecture:") print(shallow_model) print(f"\nDeep model architecture:") print(deep_model) ``` ### Step 3: Count Parameters How many trainable parameters does each architecture use? ```{python} def count_parameters(model): """Count total trainable parameters in a model""" return sum(p.numel() for p in model.parameters() if p.requires_grad) shallow_params = count_parameters(shallow_model) deep_params = count_parameters(deep_model) print("Parameter Counts:") print("-" * 50) print(f"Shallow model (1 layer × 128 units): {shallow_params:,} parameters") print(f"Deep model (3 layers): {deep_params:,} parameters") print("-" * 50) print(f"Ratio (shallow/deep): {shallow_params/deep_params:.2f}x") # Visualize parameter counts fig, ax = plt.subplots(figsize=(8, 5)) models = ['Shallow\n(1×128)', 'Deep\n(3 layers)'] params = [shallow_params, deep_params] colors = ['#ff7f0e', '#1f77b4'] bars = ax.bar(models, params, color=colors, alpha=0.7, edgecolor='black') ax.set_ylabel('Number of Parameters', fontsize=12) ax.set_title('Model Parameter Comparison', fontsize=14, fontweight='bold') ax.grid(axis='y', alpha=0.3) # Add value labels on bars for bar, param in zip(bars, params): height = bar.get_height() ax.text(bar.get_x() + bar.get_width()/2., height, f'{param:,}', ha='center', va='bottom', fontsize=11, fontweight='bold') plt.tight_layout() plt.show() ``` ### Step 4: Train Both Models ```{python} # Training configuration n_epochs = 500 learning_rate = 0.01 loss_fn = nn.MSELoss() # Track training history history = { 'Shallow': {'train_loss': [], 'test_loss': []}, 'Deep': {'train_loss': [], 'test_loss': []} } models = { 'Shallow': shallow_model, 'Deep': deep_model } # Train each model for name, model in models.items(): optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) for epoch in range(n_epochs): # Training model.train() y_pred = model(x_train_tensor) loss = loss_fn(y_pred, y_train_tensor) optimizer.zero_grad() loss.backward() optimizer.step() history[name]['train_loss'].append(loss.item()) # Evaluation on test set model.eval() with torch.no_grad(): y_test_pred = model(x_test_tensor) test_loss = loss_fn(y_test_pred, y_test_tensor).item() history[name]['test_loss'].append(test_loss) print(f"✓ {name} model trained") print("\n✓ Training complete") ``` ### Step 5: Compare Model Performance ```{python} fig, axes = plt.subplots(1, 2, figsize=(14, 5)) colors = {'Shallow': '#ff7f0e', 'Deep': '#1f77b4'} # Plot training loss for name, data in history.items(): axes[0].plot(data['train_loss'], label=name, linewidth=2, color=colors[name]) axes[0].set_xlabel('Epoch', fontsize=12) axes[0].set_ylabel('Training Loss (MSE)', fontsize=12) axes[0].set_title('Training Loss Comparison', fontsize=14, fontweight='bold') axes[0].legend(fontsize=11) axes[0].grid(True, alpha=0.3) axes[0].set_yscale('log') # Plot test loss for name, data in history.items(): axes[1].plot(data['test_loss'], label=name, linewidth=2, color=colors[name]) axes[1].set_xlabel('Epoch', fontsize=12) axes[1].set_ylabel('Test Loss (MSE)', fontsize=12) axes[1].set_title('Test Loss Comparison', fontsize=14, fontweight='bold') axes[1].legend(fontsize=11) axes[1].grid(True, alpha=0.3) axes[1].set_yscale('log') plt.tight_layout() plt.show() # Print final metrics print("\nFinal Performance (after {} epochs):".format(n_epochs)) print("-" * 70) print(f"{'Model':<15} {'Parameters':<15} {'Train Loss':<15} {'Test Loss':<15}") print("-" * 70) for name in models.keys(): params = count_parameters(models[name]) train_loss = history[name]['train_loss'][-1] test_loss = history[name]['test_loss'][-1] print(f"{name:<15} {params:<15,} {train_loss:<15.6f} {test_loss:<15.6f}") ``` --- *This experiment demonstrates the practical implications of depth versus width in neural network architecture design, showing how deeper networks can achieve competitive performance with fewer parameters.*