KL Divergence Asymmetry (Mass-Covering vs Mode-Seeking)

What you are seeing: two probability distributions on the same axis. The thick blue curve is a fixed bimodal target $P$ (a mixture of two Gaussians at $\pm 2$ ). The orange curve is a unimodal approximation $Q$ (a single Gaussian) that you control.

KL divergence is not symmetric: $D(P \| Q) = \int p(x)\,\log \frac{p(x)}{q(x)}\,dx \neq D(Q \| P).$ Minimizing $D(P \| Q)$ over $Q$ ("forward" KL, also called M-step KL or "moment matching") gives a Q that covers all the mass of P, even if that means putting probability between the two modes where P has none. Minimizing $D(Q \| P)$ (reverse KL, used by standard variational inference) gives a Q that collapses onto a single mode: it ignores the other mode entirely. The two "optimal" Q's are very different.

Two reference markers in the inset boxes show, for the current P, where each KL is minimized over $(\mu_q, \sigma_q)$ . Drag Q to compare.

Figure 1. Forward KL (mass-covering) vs reverse KL (mode-seeking) on a bimodal target. Method: numerical integration on a 600-point grid.

Q mu0.00

Q sigma2.50

mode sep2.0

WHAT TO TRY

Slide the single Gaussian Q across the bimodal target P: KL(Q||P) is minimized by Q sitting on one mode (mode-seeking), while KL(P||Q) prefers Q spanning both (mean-seeking). The asymmetry is the lesson.
Widen Q sigma: KL(P||Q) rewards covering both modes even at the cost of mass in the valley, which is why reverse-KL variational inference collapses to one mode instead.
Increase the mode separation: the two KL directions disagree more strongly, since no single Gaussian can be both narrow on one peak and broad across both.