CoPE

Contextual Positional Embedding computes context-dependent relative position values and it does so by modifying the attention mechanism. It modifies the attention mechanism in the same way as relative PE, i.e. the (i,j) entry of the attention matrix becomes aij=softmax(qi(kj+WP[pij])), where WP is a position embedding matrix of shape (nembedding,context-lenght), pij is a “fractional” index. Typically pij is an integer and so one plucks out a column of WP. Here, however WP[pij]) is a vector that is an interpolation between columns of WP.

The position of token j with respect to token i, where j<i is computed as follow. For every intermediate token jki we compute gik=σ(qikk), which tells us “how much” token k will be counted in measuring the position of token j relative to token i, so that we can have fractional values that depend on context. Overall the context-dependent position value is pij=k=jigik In, say, relative PE the value pij would simply be ij+1 and take integer values, and we would use these fractional values to “pluck out” the column of a position embedding matrix WP of size (nembedding,context-lenght). Here instead we have fractional value, so we cannot index WP. Therefore we be an interpolation of the relevant columns WP[:,pij]=(pijpij)WP[pij]+(1pijpij)WP[pij]

Nuances

  1. Each head will do CoPE independently.
  2. pij maximum value is context-length, but it is possible to clip it pij=min(pij,pmax), this is chosen small in their experiments.
Previous
Next