How expressive are Bernoulli mixture models?
Bernoulli distributions
Let \(p\in (0, 1)\) be a parameter and \(\mathrm{Ber}(p)\) be the Bernoulli distribution, with \(\mathrm{Ber}(y\mid p) = p^y(1-p)^{1-y}\) for \(y\in \{0, 1\}\). Let’s define a slightly more general family of distributions, \(\overline{\mathrm{Ber}}(p)\), for \(p\in [0, 1]\), as follows: \[ \overline{\mathrm{Ber}}(p) = p\delta_1 + (1-p)\delta_0 = \begin{cases} \delta_0 &\text{ if } p=0\\ \delta_1 &\text{ if } p=1\\ \mathrm{Ber}(p) &\text{otherwise} \end{cases} \]
Namely, the \(\mathrm{Ber}\) distributions model a procedure where a biased (e.g., by bending) coin with two sides is tossed, but \(\overline{\mathrm{Ber}}\) distributions allow tossing one-sided coins. As we have already discussed, \(\overline{\mathrm{Ber}}\) distributions are the most general probability distributions on the space \(\{0, 1\}\). Similarly, \(\mathrm{Ber}\) distributions are the most general probability distributions on \(\{0, 1\}\) with full support, allowing for both outcomes to happen.
This leads to the question: is it very problematic if we use \(\mathrm{Ber}\) distributions to model a data generating process which, in fact, belongs to the \(\overline{\mathrm{Ber}}\) family? We can expect that \(\mathrm{Ber}(u)\) for \(u \ll 1\) is rather a good model for \(\overline{\mathrm{Ber}}(0)\) (and, similarly, \(\mathrm{Ber}(1-u)\) should be a good model for \(\overline{\mathrm{Ber}}(1)\)).
Sometimes, when we program simulators, we may want to distinguish the atomic distribution \(\delta_0\) from \(\mathrm{Ber}(u)\) with small \(u\) due to explicitly enforcing some constrains: not enforcing them can break the simulator in a specific place.
However, if we think about a modelling problem, where we observe some data points \(y_1, \dotsc, y_N\) and want to learn the underlying distribution, we may prefer to leave some probability that the future outcome may be non-zero, even if all the data points seen so far are zeros.
Today we will focus on the second kind of problems, where we use \(\mathrm{Ber}\) distributions, rather than \(\overline{\mathrm{Ber}}\), either for the reason outlined above or for convenience: \(\mathrm{Ber}\) distributions never assign zero probability to any data point.
Approximation bounds
From now on we assume that \(\mathcal Y\) is a finite space and \(P\) and \(Q\) be two probability distributions on it.
Total variation distance
The total variation distance is defined as \[ \mathrm{TV}(P, Q) = \max_{A \subseteq \mathcal Y} |P(A) - Q(A)|. \]
It’s quite interesting that \(\mathrm{TV}(P, Q)\) is essentially a scaled version of the \(L_1\) norm: \[ \mathrm{TV}(P, Q) = \frac{1}{2} | P-Q |_{1} = \frac 12 \sum_{y\in \mathcal Y} | P(y) - Q(y)|. \]
The proof essentially follows by defining a signed measure \(\mu = P - Q\) and building the Hahn decomposition: let \(A^+ = \{y\in \mathcal Y\mid \mu \ge 0\}\) and \(A^- = \mathcal Y- A^+ = \{y\in \mathcal Y\mid \mu < 0\}\). We have \[ 0 = P(\mathcal Y) - Q(\mathcal Y) = \mu(\mathcal Y) = \mu(A^+) + \mu(A^-), \]
so that \(\mu(A^+) = -\mu(A^-)\).
Now for every \(A\subseteq \mathcal Y\) we have \[ \mu(A) = \mu(A\cap A^+) + \mu(A\cap A^-) \le \mu(A^+). \]
Similarly, \(-\mu(A) \le -\mu(A^-) \le \mu(A^+)\). Hence, we see that \(A^+\) can be taken as an optimal \(A\) when maximising the total variation distance and \[ \mathrm{TV}(P, Q) = \sum_{y : P(y) \ge Q(y) } \left(P(y) - Q(y)\right). \]
We now have \[\begin{align*} \mathrm{TV}(P, Q) &= \mu(A^+) \\ &= \frac{1}{2}( \mu(A^+) - \mu(A^-) ) \\ &= \frac{1}{2} \left( \sum_{y \in A^+ } \mu(y) + \sum_{y\in A^-} (-\mu(y)) \right) \\ &= \frac 12 \sum_{y\in \mathcal Y} |\mu(y)| = \frac 12 |\mu|_1. \end{align*} \]
Kullback-Leibler divergence
If \(P\ll Q\) (i.e., whenever \(Q(y) = 0\) we have \(P(y) = 0\). It’s a very convenient condition when \(Q(y) > 0\) for all \(y\in \mathcal Y\)), we can define also the Kullback-Leibler divergence: \[ \mathrm{KL}(P\parallel Q) = \sum_{y\in \mathcal Y} P(y) \frac{P(y)}{Q(y)}, \]
under the convention that \(0\log 0 = 0 \log \frac{0}{0} = 0\). If \(P \not\ll Q\), we define \(\mathrm{KL}(P\parallel Q) = +\infty\).
Pinsker’s inequalities
Pinsker’s inequality says that for arbitrary \(P\) and \(Q\) \[ \mathrm{TV}(P, Q) \le \sqrt{\frac 12 \mathrm{KL}(P\parallel Q)} \]
or, equivalently, \[ \mathrm{TV}(P, Q)^2 \le \frac{1}{2} \mathrm{KL}(P\parallel Q). \]
This inequality also generalises to infinite and not necessarily discrete spaces and has a beautiful proof due to David Pollard.
This inequality also generalises to infinite and not necessarily discrete spaces. However, as we work with finite spaces, we have also the inverse Pinsker’s inequality (see Lemma 4.1 here or Lemma 2 there) at our disposal: let \(P\ll Q\) and \(\alpha_Q = \min_{y\colon Q(y) > 0} Q(y)\). Then, we have \[ \mathrm{KL}(P \parallel Q) \le \frac{4}{\alpha_Q} \mathrm{TV}(P, Q)^2. \]
What is important here is that \(\alpha_Q\) depends on \(Q\).
It is perhaps instructive to recall an elementary proof of this result (see Lemma 2): \[ \begin{align*} \mathrm{KL}(P\parallel Q)&\le \chi^2(P\parallel Q) \\ & := \sum_{x} \frac{ (P(x) - Q(x))^2 }{Q(x)} \\ &\le \frac{1}{\alpha_Q} \left(\sum_x | P(x) - Q(x) |\right)^2 \\ &= \frac{1}{\alpha_Q} |P-Q|_1^2, \end{align*} \]
where the inequality between KL and \(\chi^2\) divergences is well-known (see e.g., here): for \(x\in (0, 1]\) we have \(\log x \le x-1\) (which follows by evaluation at \(x=1\) and comparison of the derivatives on the interval \((0, 1]\)), so that \[ \begin{align*} \mathrm{KL}(P\parallel Q) &= \sum_x P(x)\log \frac{P(x)}{Q(x)} \\&\le \sum_{x} P(x) \left( \frac{P(x)}{Q(x)}-1 \right) \\ &= \sum_x \left(\frac{ P^2(x)}{Q(x)} + Q(x) - 2P(x)\right) \\ &= \sum_x \frac{ (P(x) - Q(x))^2 }{Q(x)} \\ &= \chi^2(P\parallel Q). \end{align*} \]
Approximating the point distributions with Bernoulli distribution
Let’s see how well we can compare the point distribution \(\delta_0\) with \(\mathrm{Ber}(\epsilon)\) for \(\epsilon > 0\). We have \[ \mathrm{TV}( \mathrm{Ber}(\epsilon), \delta_0 ) = \frac{1}{2}\left( |\epsilon| + |1 - (1-\epsilon)| \right) = \epsilon, \]
meaning that this discrepancy measure can attain arbitrarily close value.
When we look for maximum likelihood solution (or employ Bayesian inference), we are interested in \(\mathrm{KL}(\delta_0 \parallel \mathrm{Ber}(\epsilon)) < \infty\), as \(\mathrm{Ber}(\epsilon)\) has full support. We can calculate this quantity exactly: \[ \mathrm{KL}(\delta_0 \parallel \mathrm{Ber}(\epsilon)) = \log \frac{1}{1-\epsilon} = -\log(1-\epsilon) = \epsilon + \frac{\epsilon^2}2 + \frac{\epsilon^3}{3} + \cdots, \]
which also can be made arbitrarily small. Namely, for every desired \(\ell > 0\) we can find an \(\tilde \epsilon > 0\) such that for all \(\epsilon < \tilde \epsilon\) we have \(\mathrm{KL}(\delta_0\parallel \mathrm{Ber}(\epsilon)) < \ell\). This is useful for establishing the KL support of the prior condition, appearing in Schwartz’s theorem.
On the other hand, we see that because \(\mathrm{Ber}(\epsilon)\not\ll \delta_0\), we have \[ \mathrm{KL}(\mathrm{Ber}(\epsilon) \parallel \delta_0 ) = +\infty, \]
meaning that variational inference with arbitrary \(\epsilon\) is always an “equally bad” approximation. Intuitively and very informally speaking, variational inference encourages approximations with lighter tails than the ground-truth distribution and it’s not possible to have lighter “tails” than a point mass!
Moving to the higher dimensions
Let’s now think about the probability distributions on the space of binary vectors, \(\mathcal Y= \{0, 1\}^G\). Similarly as above, all the distributions with full support on a finite set have to be the categorical distributions (in this case the probability vector has \(2^G\) components, with \(2^G-1\) free parameters due to the usual constraint of summing up to one). Removing the full support requirement we obtain a discrete distributions with at most \(2^G\) atoms.
Let’s consider perhaps the simplest distribution we could use to model the data points in the \(\mathcal Y\) space: \[ \mathrm{Ber}^{G}(y \mid p) = \prod_{g=1}^{G} \mathrm{Ber}(y_g \mid p_g). \]
In this case, we assume that each entry is the outcome of a coin toss and the coin tosses are independent (even though coins do not have to be identical. For example, we allow \(p_1 \neq p_2\)).
As such, this distribution is not particularly expressive and will not be suitable to model dependencies between different entries, which are very important when modelling spin lattices or genomic data. We generally may prefer to use a more expressive distribution, such energy-based models (including the Ising model) or mixture models.
Let’s focus on mixture models. In a model with \(K\) components, we have parameters \(p\in (0, 1)^{K\times G}\) and \(u \in \Delta^{K-1}\) resulting in the distribution:
\[ \mathrm{BerMix}(p, u) = \sum_{k=1}^K u_k\mathrm{Ber}^{G}(p_k), \] i.e., \[ \mathrm{BerMix}(y\mid p, u) = \sum_{k=1}^K u_k\mathrm{Ber}^{G}(y\mid p_k), \]
One usually hopes to find a relatively small number of components \(K\). However, in this blog post we allow arbitrarily large \(K\) and focus on the following questions: how expressive is this family of mixture models? Can it approximate arbitrary distributions well?
Let’s consider the ground-truth data distribution \[ D = \sum_{k=1}^{2^G} \pi_k \delta_{y_k}, \]
where \(y_k \in \mathcal Y\) are all the \(2^G\) possible atoms and \(\pi \in \bar \Delta^{2^G - 1}\) are probabilities of observing different atoms. We allow \(\pi_k = 0\) for some \(k\) (i.e., we work with the closed probability simplex, rather than the open one), which result in a smaller number of observed atoms.
We want to find a Bernoulli mixture which will approximate this distribution well. Of course, using models with a restriction on \(K\ll 2^G\) is more interesting. However, these models also seem to be much harder to study. Let’s instead use a model with all possible \(K=2^G\) components of the following form: \[ P_\epsilon = \sum_{k=1}^{K} u(\pi_k, \epsilon) \mathrm{Ber}^{G}( s_\epsilon(y_k) ), \]
where \(s_\epsilon(y) = (1-\epsilon) y + \epsilon(1-y)\) are noisy versions of the atoms (and can be understood as actually tossing coins with biases \(\epsilon\) and \(1-\epsilon\)) and \(u(\pi_k, \epsilon) = \frac{ \pi_k + \epsilon }{ 1 + \epsilon K}\) ensures that the mixing weights belong to the open simplex \(\Delta^{K-1}\).
Informally, we expect that for \(\epsilon \approx 0\) we would have \(P_\epsilon \approx D\), but let’s try to make it precise in terms of the discrepancy measures used earlier.
Bounding the total variation distance
Recall that \(\mathrm{TV}( D, P_\epsilon )\) is half of the \(L_1\) norm. Let’s focus on \(y_k \in \mathcal Y\). We have \[ |P_\epsilon(y_k) - D(y_k)| \le | u(\pi_k, \epsilon) \mathrm{Ber}^{G}(y_k \mid s_\epsilon(y_k) ) - \pi_k | + \sum_{a\neq k} u(\pi_a, \epsilon) \mathrm{Ber}^{G}(y_k \mid s_\epsilon(y_a)) \]
We have \[ | u(\pi_k, \epsilon) \mathrm{Ber}^{G}(y_k \mid s_\epsilon(y_k) ) - \pi_k | = \left| \frac{\pi_k + \epsilon}{1+\epsilon K} (1-\epsilon)^G - \pi_k \right|, \]
which intuitively can be made arbitrarily small by appropriately \(\epsilon\). More precisely, we have a Taylor approximation (employing the infinitesimal notation for big \(O\)): \[ \begin{align*} | u(\pi_k, \epsilon) \mathrm{Ber}^{G}(y_k \mid s_\epsilon(y_k) ) - \pi_k | &= \left| 1 - (2^G + G) \pi_k \right| \epsilon + O(\epsilon^2) \\ &= (1 + G + 2^G)\epsilon + O(\epsilon^2) \end{align*} \]
Now for \(a\neq k\) and \(\epsilon < 1/2\) we have \[ u(\pi_a, \alpha) \mathrm{Ber}^{G}(y_k \mid s_\epsilon(y_a) ) \le \mathrm{Ber}^{G}(y_k \mid s_\epsilon(y_a)) \le \epsilon (1-\epsilon)^{G-1} \le \epsilon, \]
where the bound follows from the following reasoning: if \(y_k = y_a\) we have the probability \((1-\epsilon)^G\) of obtaining the right outcome \(y_k\) by not encountering any bitflip due to the noise \(\epsilon\). However, as \(y_k\neq y_a\), there have to be \(m\ge 1\) positions on which we have to use the \(\epsilon\) noise to obtain the desired result \(y_k\). Hence, the probability in this case is \(\epsilon^m(1-\epsilon)^{G-m} \le \epsilon (1-\epsilon)^{G-1}\) as we have \(\epsilon < 1/2\).
To sum up, we have \[ | D(y_k) - P_\epsilon(y_k) | \le C \epsilon + O(\epsilon^2), \]
where \(C\) depends only on \(G\) (and it seems that \(C=O(2^G)\) is growing exponentially quickly with \(G\), so \(\epsilon\) has to be very tiny). By summing up over all \(k\) we have \[ \mathrm{TV}(D, P_\epsilon) \le 2^G \cdot C \cdot \epsilon + O(\epsilon^2), \]
so that it can be made arbitrarily small.
Bounding the KL divergences
Let’s think about the variational approximation with \(P_\epsilon\) to \(D\). If \(D\) does not have full support, then we have \(P_\epsilon \not\ll D\) and \(\mathrm{KL}( P_\epsilon \parallel D) = +\infty\).
However, if \(D\) is supported on the whole \(\mathcal Y\), we can use the inverse Pinsker’s inequality for some \(\alpha_D > 0\) and obtain \[ \mathrm{KL}( P_\epsilon \parallel D) \le \frac{1}{\alpha_D} O(\epsilon^2). \]
Let’s now think about \(\mathrm{KL}(D \parallel P_\epsilon)\). In this case, we have \(\alpha_\epsilon := \alpha_{P_\epsilon}\) varying with \(\epsilon\). Let’s try to bound it from below: \[ \alpha_\epsilon \ge P_\epsilon(y_k) \ge \frac{ \pi_k + \epsilon}{1+\epsilon K} (1-\epsilon)^G, \]
so that
\[ \frac{1}{\alpha_\epsilon} \le \frac{(1-\epsilon)^{-G} (1+ K\epsilon) }{\pi_k + \epsilon}. \]
If \(D\) is fully supported, we have \(\pi_k > 0\) and \(\mathrm{KL}(P_\epsilon \parallel D) \le O(\epsilon^2)\) should also decrease at the quadratic rate. However, if \(\pi_k = 0\) for some \(k\), we still seem to have \(\mathrm{KL}(P_\epsilon \parallel D) \le O(\epsilon)\).
Hence, it seems to me that Bernoulli mixtures (albeit with too many components to be practically useful) approximate well arbitrary distributions in terms of the total variation and forward KL divergence, while the backward KL divergence is well-approximated whenever the distribution has full support.
However, I am not sure if I did not make a mistake somewhere in the calculations: if you something suspicious with this derivation, please let me know.