Layer Norm
Paper: Layer Normalization
Layer normalization is different from batch normalization. In layer normalization each token is first normalized
The original Transformer paper used LayerNorm after the residual connection and the MSHA. However, nowadays people use LayerNorm before the MHSA and FF layers to stabilize gradients during training.