Layer Norm
Paper: Layer Normalization
Layer normalization is different from batch normalization. In layer normalization each token is first normalized $$ x_{\text{normalized}} = \frac{x - \text{mean}(x)}{\text{std}(x)} $$ and then it is shifted and scaled by learnable parameters $\gamma$ (scale) and $\beta$ (shift) $$ \gamma x_{\text{normalized}} + \beta. $$
The original Transformer paper used LayerNorm after the residual connection and the MSHA. However, nowadays people use LayerNorm before the MHSA and FF layers to stabilize gradients during training.