Relative

Relative positional encoding was introduced in Self-Attention with Relative Position Representations and then improved in Music Transformer.

Self-Attention with Relative Position Representations

They modify the self-attention mechanism to include relative positional information. Suppose the input sequence is x=(x1,,xn), meaning that n is the context length, and xiRdx. In a single head, the standard self-attention mechanism, given query, key and value matrices WQ, WK and WV computes eij=(xiWQ)(xjWK)dzsimilaritiesαij=exp(eij)k=1nexp(eik)softmax normalizationzi=j=1nαij(xjWV)attention and value multiplications They additionally learn two vectors for each pair of tokens in the context, aijV and aijK. These vectors are dz dimensional and are shared across heads. Then, their new attention mechanism is eij=(xiWQ)(xjWK+aijK)dzrelative similaritiesαij=exp(eij)k=1nexp(eik)softmax normalizationzi=j=1nαij(xjWV+aijV)attention with relative value In practice, they don’t learn two vectors for all possible pairs of tokens in the context. Instead, they choose a maximum “relative distance” k and then clip it so that they only learn aijV and aijK from k to k.

Previous
Next