KL divergence asymmetry: mass-covering vs mode-seeking
What you are seeing: two probability distributions on the same axis. The thick blue curve is a fixed bimodal target (a mixture of two Gaussians at ). The orange curve is a unimodal approximation (a single Gaussian) that you control.
KL divergence is not symmetric: Minimizing over ("forward" KL, also called M-step KL or "moment matching") gives a Q that covers all the mass of P, even if that means putting probability between the two modes where P has none. Minimizing (reverse KL, used by standard variational inference) gives a Q that collapses onto a single mode: it ignores the other mode entirely. The two "optimal" Q's are very different.
Two reference markers in the inset boxes show, for the current P, where each KL is minimized over . Drag Q to compare.
WHAT TO TRY
- Vary each control and watch the rail readouts respond.
- Compare the diagnostic plot against the live scene.