Relative

Relative positional encoding was introduced in Self-Attention with Relative Position Representations and then improved in Music Transformer.

Self-Attention with Relative Position Representations

They modify the self-attention mechanism to include relative positional information. Suppose the input sequence is $x = (x_{1}, \dots, x_{n})$ , meaning that $n$ is the context length, and $x_{i} \in R^{d_{x}}$ . In a single head, the standard self-attention mechanism, given query, key and value matrices $W^{Q}$ , $W^{K}$ and $W^{V}$ computes $\begin{aligned} e_{i j} & = \frac{(x_{i} W^{Q}) (x_{j} W^{K})^{⊤}}{\sqrt{d_{z}}} & similarities \\ α_{i j} & = \frac{\exp (e_{i j})}{\sum_{k = 1}^{n} \exp (e_{i k})} & softmax normalization \\ z_{i} & = \sum_{j = 1}^{n} α_{i j} (x_{j} W^{V}) & attention and value multiplications \end{aligned}$ They additionally learn two vectors for each pair of tokens in the context, $a_{i j}^{V}$ and $a_{i j}^{K}$ . These vectors are $d_{z}$ dimensional and are shared across heads. Then, their new attention mechanism is $\begin{aligned} e_{i j} & = \frac{(x_{i} W^{Q}) (x_{j} W^{K} + a_{i j}^{K})^{⊤}}{\sqrt{d_{z}}} & relative similarities \\ α_{i j} & = \frac{\exp (e_{i j})}{\sum_{k = 1}^{n} \exp (e_{i k})} & softmax normalization \\ z_{i} & = \sum_{j = 1}^{n} α_{i j} (x_{j} W^{V} + a_{i j}^{V}) & attention with relative value \end{aligned}$ In practice, they don’t learn two vectors for all possible pairs of tokens in the context. Instead, they choose a maximum “relative distance” $k$ and then clip it so that they only learn $a_{i j}^{V}$ and $a_{i j}^{K}$ from $- k$ to $k$ .

Last updated on Jul 6, 2024

Edit this page