Layer Norm

Paper: Layer Normalization

Layer normalization is different from batch normalization. In layer normalization each token is first normalized xnormalized=xmean(x)std(x) and then it is shifted and scaled by learnable parameters γ (scale) and β (shift) γxnormalized+β.

The original Transformer paper used LayerNorm after the residual connection and the MSHA. However, nowadays people use LayerNorm before the MHSA and FF layers to stabilize gradients during training.

Previous
Next