Post-LN

The original Transformer paper used LayerNorm after the residual connection and the Multi-Head Self-Attention (MHSA). Suppose $x$ was our input sequence of tokens, then the sequence would go through MHSA, then added to itself and then through LayerNorm $y = LayerNorm (MHSA (x) + x) .$ This would be fed through a Feed-Forward Neural Network (FFNN) following the same proceedure $z = LayerNorm (FFNN (y) + y)$ This is known as Post-LN, since the LayerNorm is done after the layer of interest (MHSA/FFNN) and importantly it is done after the residual connection.

Last updated on Jul 6, 2024

Edit this page