Paper Summary: An Introduction To Transformers - Turner (2023)
Summary of “An Introduction to Transformers” by Richard E. Turner (2023).
Input to the Transformer
The input to the transformer is a sequence
Transformer Block
The output of the transformer is denoted
- Stage 1:
- Standardisation: Each token is standardized separately before being fed into the tranformer to stabilise learning: to each token
we subtract the mean and divide by the standard deviation so we form . Typically this is called LayerNorm, but the author calls it TokenNorm. - MHSA Layer: Consists of
heads, meaning that we perform operations in parallel. Each of the operations, shown below, consists of multiplying the standardised sequence by the attention matrix and the projecting, thus obtaining a matrix and then “linearly projecting” this by . The attention matrix is constructed from the input. The vectors and are known as queries and keys respectively and we use the matrices and to introduce a notion of asymmetry of the relationship between the various tokens. These matrices have shape where typically . Each element of is then the softmax of this relationship and we use multiple heads because each head is capturing a different type of relationship. The parameters for this layer are and for every . - Residual Connection: We use residual connections so that each preceding transformation is not too dissimilar to a identity operation: it only introduces a mild non-linearity. This stabilizes training, and if we have enough layers, it will still be able to learn complex representations. To implement the residual connection we simply do
and we point out that we sum the un-standadized . This ends the first stage.
- Standardisation: Each token is standardized separately before being fed into the tranformer to stabilise learning: to each token
- Stage 2
- Standardization: We do the same as before and standardize each token in
to obtain . - MLP Layer: A simple (usually quite shallow) MLP is applied to each transformed token
to obtain the corresponding representation Notice that we have a single MLP and this is used for all the transformed tokens. - Residual Connection: Finally, we apply one last residual connection to obtain
where we notice, one again, that we sum the output of the first stage, before being normalised by the second stage.
- Standardization: We do the same as before and standardize each token in
Transformers for Sequential Tasks
To use Transformers for sequential tasks we use masking which means that the matrices