Layer Norm

Paper: Layer Normalization

Layer normalization is different from batch normalization. In layer normalization each token is first normalized $$ x_{\text{normalized}} = \frac{x - \text{mean}(x)}{\text{std}(x)} $$ and then it is shifted and scaled by learnable parameters $\gamma$ (scale) and $\beta$ (shift) $$ \gamma x_{\text{normalized}} + \beta. $$

The original Transformer paper used LayerNorm after the residual connection and the MSHA. However, nowadays people use LayerNorm before the MHSA and FF layers to stabilize gradients during training.

Previous
Next