Contextual Positional Embedding computes context-dependent relative position values and it does so by modifying the attention mechanism. It modifies the attention mechanism in the same way as relative PE, i.e. the entry of the attention matrix becomes
where is a position embedding matrix of shape , is a “fractional” index. Typically is an integer and so one plucks out a column of . Here, however is a vector that is an interpolation between columns of .
The position of token with respect to token , where is computed as follow. For every intermediate token we compute , which tells us “how much” token will be counted in measuring the position of token relative to token , so that we can have fractional values that depend on context. Overall the context-dependent position value is
In, say, relative PE the value would simply be and take integer values, and we would use these fractional values to “pluck out” the column of a position embedding matrix of size . Here instead we have fractional value, so we cannot index . Therefore we be an interpolation of the relevant columns
Nuances
- Each head will do CoPE independently.
- maximum value is , but it is possible to clip it , this is chosen small in their experiments.