Back

Attention as soft retrieval

Single-head dot-product attention over a small key-value bank. Each key is a point in 2D; values are scalars rendered as bar heights. Drag the query around the $(k_x, k_y)$ plane to recompute the attention weights, then watch the output bar approach the weighted average of the value bars. As temperature $\tau$ drops to zero the attention collapses to one-hot retrieval of the nearest key; as $\tau$ grows the distribution flattens toward uniform.

Figure 1. Soft retrieval via scaled dot-product attention with temperature. Method: standard softmax with the 1/sqrt(d) scale (d = 2 here).
tau0.50
q_x0.00
q_y0.00

WHAT TO TRY

  • Vary each control and watch the rail readouts respond.
  • Compare the diagnostic plot against the live scene.