Positional Encoding
The attention mechanism is order-invariant. Positional Encodings (PE) are used to inject positional information of the tokens (within the context) into the Transformer architecture. Typically, positional encoding are vectors of the same dimension as the embedding vectors, and they get added to them.
PE can be absolute or relative, thus encoding either the absolute position of a token within a context or the position of a token relative to others. PE can be deterministic or learned, thus either being a user-specified function or be initialized as random vectors and then learned during training.
Absolute | Relative | |
---|---|---|
Deterministic | Sinusoidal PE | |
Learned | Convolutional Sequence to Sequence Learning | Self-Attention with Relative Position Representations and Music Transformer |
Additionally, CoPE uses context-dependent relative PE.