Back

KL divergence asymmetry: mass-covering vs mode-seeking

What you are seeing: two probability distributions on the same axis. The thick blue curve is a fixed bimodal target PP (a mixture of two Gaussians at ±2\pm 2). The orange curve is a unimodal approximation QQ (a single Gaussian) that you control.

KL divergence is not symmetric: D(PQ)=p(x)logp(x)q(x)dxD(QP).D(P \| Q) = \int p(x)\,\log \frac{p(x)}{q(x)}\,dx \neq D(Q \| P). Minimizing D(PQ)D(P \| Q) over QQ ("forward" KL, also called M-step KL or "moment matching") gives a Q that covers all the mass of P, even if that means putting probability between the two modes where P has none. Minimizing D(QP)D(Q \| P) (reverse KL, used by standard variational inference) gives a Q that collapses onto a single mode: it ignores the other mode entirely. The two "optimal" Q's are very different.

Two reference markers in the inset boxes show, for the current P, where each KL is minimized over (μq,σq)(\mu_q, \sigma_q). Drag Q to compare.

Figure 1. Forward KL (mass-covering) vs reverse KL (mode-seeking) on a bimodal target. Method: numerical integration on a 600-point grid.
Q mu0.00
Q sigma2.50
mode sep2.0

WHAT TO TRY

  • Vary each control and watch the rail readouts respond.
  • Compare the diagnostic plot against the live scene.