This demo shows how the representation of a query token changes in an attention mechanism based on its relation to other tokens and the specified value transformation \(v(x)\).
Main update equation (how the representation of a given input token is transformed):
\[ x_q' = \sum_{x_k \in M(x_q, S_{x_k})} a(x_q, x_k; \theta_a)\, v(x_k; \theta_v) \]
Here, \(S_{x_k}\) is the set of all key tokens, that is, all tokens forming the input. \(M(x_q, S_{x_k})\) is the masking or neighbourhood set for a given query token \(x_q\), meaning the subset of tokens it is allowed to attend to. The function \(a(x_q, x_k; \theta_a)\) is the affinity function that produces a similarity degree between the query token \(x_q\) and key token \(x_k\), and it is learnable with parameters \(\theta_a\). The function \(v(x_k; \theta_v)\) maps token \(x_k\) into the transformed value space and is also learnable, with parameters \(\theta_v\). The updated query representation \(x_q'\) is therefore a weighted average of transformed value vectors where weights are based on affinity or attention values between tokens.