Layer Norm

Paper: Layer Normalization

Layer normalization is different from batch normalization. In layer normalization each token is first normalized $x_{normalized} = \frac{x - mean (x)}{std (x)}$ and then it is shifted and scaled by learnable parameters $γ$ (scale) and $β$ (shift) $γ x_{normalized} + β .$

The original Transformer paper used LayerNorm after the residual connection and the MSHA. However, nowadays people use LayerNorm before the MHSA and FF layers to stabilize gradients during training.

Last updated on Jul 6, 2024

Edit this page