Post-LN

The original Transformer paper used LayerNorm after the residual connection and the Multi-Head Self-Attention (MHSA). Suppose x was our input sequence of tokens, then the sequence would go through MHSA, then added to itself and then through LayerNorm y=LayerNorm(MHSA(x)+x). This would be fed through a Feed-Forward Neural Network (FFNN) following the same proceedure z=LayerNorm(FFNN(y)+y) This is known as Post-LN, since the LayerNorm is done after the layer of interest (MHSA/FFNN) and importantly it is done after the residual connection.

Previous
Next