Relative positional encoding was introduced in Self-Attention with Relative Position Representations and then improved in Music Transformer.
Self-Attention with Relative Position Representations
They modify the self-attention mechanism to include relative positional information. Suppose the input sequence is , meaning that is the context length, and . In a single head, the standard self-attention mechanism, given query, key and value matrices , and computes
They additionally learn two vectors for each pair of tokens in the context, and . These vectors are dimensional and are shared across heads. Then, their new attention mechanism is
In practice, they don’t learn two vectors for all possible pairs of tokens in the context. Instead, they choose a maximum “relative distance” and then clip it so that they only learn and from to .