Batch Normalization: Accelerating Deep Network Training

Deep Learning
Optimization
Normalization
Author

Chao Ma

Published

February 5, 2026

Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Problem

Deep networks train slowly because the distribution of activations keeps shifting as earlier layers change. This pushes activations into saturated regions (e.g., sigmoid), where gradients vanish and optimization stalls.

For a sigmoid \(g(x)=\frac{1}{1+e^{-x}}\), as \(|x|\) grows, \(g'(x)\to 0\). If the pre-activation \(x=Wu+b\) drifts, gradients to \(u\) shrink and learning slows.

Sigmoid saturation

Key idea

Normalize activations within each mini-batch to stabilize their distribution, then allow the model to restore scale and shift with learnable parameters.

Intuition: BN keeps each feature in a predictable range so later layers do not chase a moving target, while \(\gamma,\beta\) preserve expressiveness by letting the network undo normalization if needed.

Why normalization must be differentiable

If we normalize using batch statistics but ignore them in backprop, the update can be canceled by the next normalization step.

Example: \(\hat{x}=x-\mathbb{E}(x)\). If we update \(b\) without accounting for \(\mathbb{E}(x)\), the next recomputed mean removes that update. So the normalization must be part of the computation graph.

Where BN is applied

For a layer pre-activation \(z = Wx + b\), the paper applies BN to \(z\) before the nonlinearity:

\[ \tilde{z} = \text{BN}(z), \quad y = \phi(\tilde{z}) \]

Some modern implementations apply BN after the activation; the key is normalizing each feature (or channel) consistently.

BatchNorm for a vector layer

For a mini-batch \(\mathcal{B} = \{x_1,\dots,x_m\}\):

\[ \mu_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m x_i,\quad \sigma^2_\mathcal{B} = \frac{1}{m}\sum_{i=1}^m (x_i-\mu_\mathcal{B})^2 \]

Normalize and re-scale:

\[ \hat{x}_i = \frac{x_i-\mu_\mathcal{B}}{\sqrt{\sigma^2_\mathcal{B}+\epsilon}},\quad y_i = \gamma\hat{x}_i + \beta \]

  • \(\gamma,\beta\) are learnable, so the layer can recover the original activation scale if needed.

BatchNorm transform

Gradients (core idea)

BatchNorm is fully differentiable:

  • \(\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i}\cdot \gamma\)
  • \(\frac{\partial L}{\partial \gamma} = \sum_i \frac{\partial L}{\partial y_i}\hat{x}_i\)
  • \(\frac{\partial L}{\partial \beta} = \sum_i \frac{\partial L}{\partial y_i}\)

2D BatchNorm (CNNs)

For \(x\in\mathbb{R}^{N\times C\times H\times W}\), normalize per channel:

\[ \mu_c = \frac{1}{NHW}\sum_{i,h,w} x_{i,c,h,w},\quad \sigma_c^2 = \frac{1}{NHW}\sum_{i,h,w} (x_{i,c,h,w}-\mu_c)^2 \]

\[ \hat{x}_{i,c,h,w} = \frac{x_{i,c,h,w}-\mu_c}{\sqrt{\sigma_c^2+\epsilon}},\quad y_{i,c,h,w} = \gamma_c\hat{x}_{i,c,h,w}+\beta_c \]

Training vs inference

  • Training uses batch statistics.
  • Inference uses running averages of mean/variance accumulated during training.

A typical update for running stats:

\[ \mu_{\text{run}} \leftarrow \rho\,\mu_{\text{run}} + (1-\rho)\,\mu_{\mathcal{B}},\quad \sigma^2_{\text{run}} \leftarrow \rho\,\sigma^2_{\text{run}} + (1-\rho)\,\sigma^2_{\mathcal{B}} \]

This makes predictions deterministic and independent of batch composition.

Effects and practical notes

  • Enables larger learning rates and faster convergence.
  • Acts as a form of regularization (mini-batch noise).
  • Reduces sensitivity to initialization.
  • Improves conditioning by keeping feature scales comparable.

Takeaway. BatchNorm stabilizes activation distributions by normalizing per mini-batch and then re-introducing scale and shift. This makes deep optimization easier and faster while adding mild regularization.