The mutual information saga

Behind the scenes of our mutual information research.
Author

Paweł Czyż

Published

November 20, 2023

Where and when should we start this story? Probably a good origin will be in the Laboratory of Modeling in Biology and Medicine in Warsaw, where Tomasz Lipniacki and Marek Kochańczyk decided to mentor two students who just completed high-school education, namely the younger versions of Frederic and myself.

Informally speaking, we tried to model the MAPK pathway as a communication channel. Imagine that the cell is given some input \(x\in \mathcal X\) and we measure the response \(y\in \mathcal Y\). Once we vary \(x\) and we record different values \(y\) we may start observing some patterns – perhaps changes with \(y\) is somehow associated with changes in \(x\). To make this more formal, consider a random variable \(X\) representing the given inputs and another random variable \(Y\) representing the outputs. The mutual information \(I(X; Y)\) measures how dependent these variables are: \(I(X; Y) = 0\) if and only if \(X\) and \(Y\) are independent. Contrary to correlation, mutual information works for any kind of variables (continuous, discrete, of arbitrary dimension…) and can capture nonlinear dependencies.

I enjoyed my time in the project and the provided mentorship very much! We wrote a paper, and – perhaps more importantly – I learned that information theory and biology are great fields to study!

Many years have passed and I thought that perhaps it’s the time to become a mentor on my own. Fortunately, the very first Master’s student I supervised, Anej, was so bright and motivated that he wrote a great Master’s thesis despite having such an unexperienced supervisor as me! Anej was working on representation learning and extensively used mutual information estimators. But I had not done my homework: when I wrote the project proposal, I happily assumed that mutual information estimators have to work, if the space \(\mathcal X\times \mathcal Y\) is of moderate dimension and the number of points is quite large. I was wrong. Different mutual information estimators seemed to give very different estimates and that was concerning.

Beyond normal

I did the only rational thing in this situation: I ran screaming for help to two mutual information experts, Frederic and Alex. We started thinking:

  1. How do we really know when mutual information estimators work? As mutual information is analytically known only for the simplest distributions, the estimators are evaluated usually on “simple” low-dimensional distributions (or moderate-dimensional multivariate normal distributions).
  2. Is it possible to construct more expressive distributions with known ground-truth mutual information?
  3. How invariant are the estimators to diffeomorphisms? Namely, if \(f\) and \(g\) are diffeomorphisms, then \(I(X; Y) = I(f(X); g(Y))\). Do the numerical estimates have the same property?

The 1st and 2nd question are related. But so are 2nd and 3rd! Suppose that we can easily sample points \((x_1, y_1), \dotsc, (x_n, y_n)\) from the joint distribution \(P_{XY}\). If \(f\colon \mathcal X\to \mathcal X\) and \(g\colon \mathcal Y\to \mathcal Y\) are diffeomorphisms, we can apply them to obtain a sample \((f(x_1), g(y_1)), \dotsc, (f(x_n), g(y_n))\) from \(P_{f(X)g(Y)}\), which is a joint distribution between variables \(f(X)\) and \(g(Y)\). As we apply a diffeomorphism1, the mutual information does not change: \(I(X; Y) = I(f(X); g(Y))\).

Frederic and I started programming2 different distributions and transformations, in the meantime learning , and after five months we had a ready manuscript titled Are mutual information estimators homeomorphism-invariant?, which shows that the 3rd question was the most important one for a lapsed differential geometer who is currently trying to be an imposter in the machine learning world me.

Well, I was wrong: after the manuscript got rejected from ICML, we realized that the most important aspect of our work was actually using the transformed distributions to study the strengths and limitations of existing estimators3. We improved the experiments in the paper and changed the story to Beyond normal: on the evaluation of the mutual information estimators4, which was accepted to NeurIPS.

Of course, we were very happy. But there were some important aspects that deserved to be studied a bit more…

Here comes the trouble

Really that expressive?

Our distributions were only “beyond normal”, rather than “just amazing”: we suspected that one cannot construct all the interesting distributions.

Consider \(\mathcal X = \mathbb R^m\), \(\mathcal Y = \mathbb R^n\) and a random vector \(Z \sim \mathcal N(0, I_{m+n})\). Normalizing flows guarantee that there exists a diffeomorphism \(u: \mathcal X\times \mathcal Y \to \mathcal X\times \mathcal Y\) such that \(P_{XY}\) is well-approximated5 by the distribution of \(u(Z)\).

However, the diffeomorphism \(u\) does not have to be of the form \(f\times g\) (recall that \((f\times g)(x, y) = (f(x), g(y))\)), which leaves the mutual information invariant. Thinking geometrically, the product group \(\mathrm{Diff}(\mathcal X)\times \mathrm{Diff}(\mathcal Y)\) is usually a very small subgroup of \(\mathrm{Diff}(\mathcal X\times \mathcal Y)\), the group of all diffeomorphisms of \(\mathcal X\times \mathcal Y\).

We had this intuition quite early, but we did not have a convincing counterexample that our distributions were not sufficient. However, Frederic started plotting histograms of pointwise mutual information, which let us formalize this intuition.

Consider the pointwise mutual information: \[ i_{XY}(x, y) = \log \frac{ p_{XY}(x, y) }{ p_X(x)\, p_Y(y) },\] where \(p_{XY}\) is the PDF of the joint distribution and \(p_X\) and \(p_Y\) are the PDFs of the marginal distributions. It is easy to prove that if \(f\) and \(g\) are diffeomorphisms and that \(x'=f(x)\) and \(y'=g(y)\), then6 \[i_{XY}(x, y) = i_{f(X)g(Y)}(x', y').\] From this it is easy to observe that the distribution of the random variable \(i_{XY}(X; Y)\) is the same as of \(i_{f(X)g(Y)}(f(X); g(Y))\). We termed it the pointwise mutual information profile, although it’s more than likely that people had already studied this before us.

Hence, diffeomorphisms leave invariant not only the mutual information: they leave invariant also the whole pointwise mutual information profile, what limits how expressive the distributions can be; see also this post for more general invariance results. We did not have yet a counterexample, but a strong feeling that it should exist: we just needed to find distributions with different profiles, but the same mutual information.

Model-based mutual information estimation

We started the project with the idea that if \((A, B)\sim \mathcal N(0, \Sigma)\), then \(I(A; B)\) is analytically known in terms of the covariance matrix \(\Sigma\) and we can obtain more complicated dependencies between \(X=f(A)\) and \(Y=f(B)\) without changing the mutual information: \(I(X; Y) = I(A; B)\).

At the same time we asked the inverse question: if we have \(X\) and \(Y\), can we perhaps find a covariance matrix \(\Sigma\) and a normalizing flow \(f\times g\) such that \((X, Y) = (f(A), g(B))\) and \((A, B)\) are distributed according to the multivariate normal distribution with covariance matrix \(\Sigma\)? If \(f\) and \(g\) are identity functions, this construction corresponds to the assumption that \((X, Y)\) are multivariate normal and calculating the mutual information via the estimation of the joint covariance matrix. A particular example of this approach is canonical correlation analysis, which worked remarkably well for multivariate normal distributions, providing more accurate estimates and requiring a lower number of samples available.

However, as discussed above, generally we cannot expect that a normalizing flow of the form \(f\times g\) will transform a distribution to a multivariate normal. So there is some potential for more explicit modelling of the joint distribution \(P_{XY}\), but we needed to make sure that it is expressive enough to cover some interesting cases.

Do outliers break everything?

There’s no real data without real noise and we wanted to have distributions which can be used in practice. One source of noise are outliers, which sometimes can be attributed to errors in data collection or recording (e.g., the equipment was apparently switched off or some piece of experimental setup broke), and are well-known suspects when an estimator behaves badly. In Beyond normal we investigated heavy-tailed distributions (either by applying some transformations to multivariate normal distributions to make the tails heavier, or by using multivariate Student distributions), but we felt that it was not enough.

The mixtures (and the critics)

The outliers here were the most concerning and I had the feeling that if \(P_{XY}\) is the distribution from which we want to sample and \(P_{\tilde X \tilde Y}\) is the “noise distribution” with \(I(\tilde X; \tilde Y) = 0\), then we could perhaps calculate the information contained in the mixture distribution: \[P_{X'Y'} = (1-\alpha)\, P_{XY} + \alpha \, P_{\tilde X \tilde Y},\] where \(\alpha \ll 1\) is the fraction of outliers.

I spent quite some time with a pen and paper trying to calculate \(I(X'; Y')\) in terms of \(I(X; Y)\), but I could not really derive anything. Even proving the conjectured bound that \(I(X'; Y')\) should not exceed \(I(X; Y)\) was hard…

And it’s actually good that I didn’t manage to prove it: this conjecture is false. When I asked Frederic about it, he immediately responded with: 1. An example of two distributions such that each of them encodes 0 bits, but their mixture encodes 1 bit. 2. An example of two distributions such that each of them encodes 1 bit, but their mixture encodes 0 bits.

This is disturbing. As Larry Wasserman said: I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs. Fortunately, when I was still struggling with trying to prove it, I recalled Frederic’s histograms, approximating the pointwise mutual information profile. For multivariate normal and Student distributions he sampled a lot of data points \((x_1, y_1), \dotsc, (x_n, y_n)\) and then evaluated the pointwise mutual information \(i_{XY}(x_i, y_i)\) – which is easy to evaluate using \(\log p_{XY}\), \(\log p_X\) and \(\log p_Y\) densities – to construct a histogram. The mean of this sample is the estimate of the mutual information \(I(X; Y) = \mathbb E_{(x, y)\sim P_{XY} }[i_{XY}(x; y)]\).

This works for multivariate normal distributions and Student distributions, so why wouldn’t it work for mixture distributions? In the end this is simply a Monte Carlo estimator: we only need to sample a lot of data points (and sampling from a mixture distribution is trivial if one can sample from the component distributions) and evaluate the pointwise mutual information (which can be calculated from the PDFs of the involved variables. The PDF of the mixture distribution can be evaluated using the PDFs of the components).

Hence, although we do not have an exact formula for \(I(X; Y)\) where \(P_{XY}\) is a mixture of multivariate normal or Student distributions, we have its Monte Carlo approximation. Now we can apply diffeomorphisms to this distribution obtaining more expressive distribution, having e.g., a normalizing flow applied to a Gaussian mixture or a mixture of normalizing flow, or mix all of these again…

Implementation of a prototype in TensorFlow Probability took one day7 and after a month we had a manuscript ready.

I very much like this concept because:

  • This construction can model outliers as a mixture of different distributions.
  • We can construct a mixture distribution with a different pointwise mutual information profile than the multivariate normal distribution. Due to the fact that Gaussian mixtures are very expressive, we can build a whole new family of distributions!
  • We also do not have to model \(P_{XY}\) as a multivariate normal distribution transformed by diffeomorphism \(f\times g\) – we can use Gaussian (or Student) mixtures! We quickly built a prototype of model-based mutual information estimation with mixtures, which can provide uncertainty quantification on mutual information in the usual Bayesian manner.

Let’s finish this post at the place where we started: in Beyond normal we could apply diffeomorphisms to transform continuous \(X\) and \(Y\) variables. The Monte Carlo estimator from the follow-up manuscript applies in exactly the same manner also to the case when \(X\) is discrete variable and \(Y\) is continuous, which is exactly the case we investigated many years ago!

Footnotes

  1. More precisely, if all the spaces involved are standard Borel, then \(f\colon \mathcal X\to \mathcal X'\) and \(g\colon \mathcal Y\to \mathcal Y'\) can be continuous injective mappings and e.g., increase the dimensionality of the space. See Theorem 2.1 here, which is a well-known fact, but it still took us quite some time to prove it. M.S. Pinsker’s Information and information stability of random variables and processes proved to be an invaluable resource.↩︎

  2. As a byproduct we learned Snakemake, which transformed my approach to data science entirely. But this is a different story.↩︎

  3. Also, Frederic figured out that the 3rd question (whether the estimators which are invariant to diffeomorphisms) has a trivial answer. If \(\mathcal M\) is a connected smooth manifold of dimension at least 2, then for any two sets of distinct points \(\{a_1, \dotsc, a_n\}\) and \(\{b_1, \dotsc, b_n\}\) there exists a diffeomorphism \(u\colon \mathcal M\to \mathcal M\) such that \(b_i = u(a_i)\) (the proof of this fact can be e.g., found in P.W. Michor’s and Cornelia Vizman’s \(n\)-transitivity of certain diffeomorphisms groups). Hence, if \(\mathcal X\) and \(\mathcal Y\) fulfil the assumptions above, we can move a finite data set as we wish and the only invariant estimator has to return the same answer for any set of input data points.↩︎

  4. Let me state two obvious facts. First: “beyond” refers to the fact that we can transform multivariate normal distributions to obtain more expressive distributions. Second: although I intended to name the paper after a wonderful musical, Next to Normal, we worried that the title was copyrighted.↩︎

  5. Important detail: “well-approximated” using one statistical distance may me “badly approximated” with respect to another one.↩︎

  6. It’s tempting to say that pointwise mutual information transforms as a scalar under the transformations from the group \(\mathrm{Diff}(\mathcal X)\times \mathrm{Diff}(\mathcal Y)\).↩︎

  7. Actually, it took three weeks of me complaining that we couldn’t use multivariate Student distributions in TensorFlow Probability on JAX. Since I had learned that I was wrong, a few hours passed before we had the prototype and a couple of more days before it was refactored into a stable solution.↩︎