Post-LN
The original Transformer paper used LayerNorm after the residual connection and the Multi-Head Self-Attention (MHSA). Suppose
The original Transformer paper used LayerNorm after the residual connection and the Multi-Head Self-Attention (MHSA). Suppose