This demo shows how the representation of a query token changes in an attention mechanism based on its relation to other tokens and the specified value transformation \(v(x)\).

Main update equation (how the representation of a given input token is transformed):

\[ x_q' = \sum_{x_k \in M(x_q, S_{x_k})} a(x_q, x_k; \theta_a)\, v(x_k; \theta_v) \]

Here, \(S_{x_k}\) is the set of all key tokens, that is, all tokens forming the input. \(M(x_q, S_{x_k})\) is the masking or neighbourhood set for a given query token \(x_q\), meaning the subset of tokens it is allowed to attend to. The function \(a(x_q, x_k; \theta_a)\) is the affinity function that produces a similarity degree between the query token \(x_q\) and key token \(x_k\), and it is learnable with parameters \(\theta_a\). The function \(v(x_k; \theta_v)\) maps token \(x_k\) into the transformed value space and is also learnable, with parameters \(\theta_v\). The updated query representation \(x_q'\) is therefore a weighted average of transformed value vectors where weights are based on affinity or attention values between tokens.

(c) Fayyaz Minhas
1. Tokens in original feature space
Click anywhere in this panel to place the query token \(x_q\). Black links show the attention strengths to the other tokens.
2. Tokens in the transformed (value) space
This panel shows where the tokens move after applying the value transformation \(v(x)\).
3. Attention as springs in value space
Each transformed token \(v(x_k)\) acts like an anchor. Stronger attention produces a thicker spring. The self spring is shown separately in purple.