CoPE

Contextual Positional Embedding computes context-dependent relative position values and it does so by modifying the attention mechanism. It modifies the attention mechanism in the same way as relative PE, i.e. the $(i, j)$ entry of the attention matrix becomes $a_{i j} = softmax (q_{i}^{⊤} (k_{j} + W_{P} [p_{i j}])),$ where $W_{P}$ is a position embedding matrix of shape $(n_{embedding}, context-lenght)$ , $p_{i j}$ is a “fractional” index. Typically $p_{i j}$ is an integer and so one plucks out a column of $W_{P}$ . Here, however $W_{P} [p_{i j}])$ is a vector that is an interpolation between columns of $W_{P}$ .

The position of token $j$ with respect to token $i$ , where $j < i$ is computed as follow. For every intermediate token $j \leq k \leq i$ we compute $g_{i k} = σ (q_{i}^{⊤} k_{k})$ , which tells us “how much” token $k$ will be counted in measuring the position of token $j$ relative to token $i$ , so that we can have fractional values that depend on context. Overall the context-dependent position value is $p_{i j} = \sum_{k = j}^{i} g_{i k}$ In, say, relative PE the value $p_{i j}$ would simply be $i - j + 1$ and take integer values, and we would use these fractional values to “pluck out” the column of a position embedding matrix $W_{P}$ of size $(n_{embedding}, context-lenght)$ . Here instead we have fractional value, so we cannot index $W_{P}$ . Therefore we be an interpolation of the relevant columns $W_{P} [:, p_{i j}] = (p_{i j} - ⌊ p_{i j} ⌋) W_{P} [⌈ p_{i j} ⌉] + (1 - p_{i j} - ⌊ p_{i j} ⌋) W_{P} [⌊ p_{i j} ⌋]$

Nuances

Each head will do CoPE independently.
$p_{i j}$ maximum value is $context-length$ , but it is possible to clip it $p_{i j} = min (p_{i j}, p_{max})$ , this is chosen small in their experiments.

Last updated on Jul 6, 2024

Edit this page