Invariance of the mutual information and the profile
In our recent manuscript, we study the pointwise mutual information profile, which generalizes the notion of mutual information between two random variables to higher moments. In this paper, we assume that all the spaces and functions involved have sufficient regularity conditions (e.g., smoothness is our typical assumption) and prove that the pointwise mutual information profile doesn’t change when either variable is reparameterized by a diffeomorphism.
These assumptions are convenient, allowing for very short proofs for many interesting distributions, but they are also too strong. In this post, we’ll pay the price of doing some measure theory to make the results more general and show what is really the point there. Update: A summary of this post has been to our manuscript as Appendix A.5.
KL divergence
Let \(P\) and \(Q\) be two probability distributions on a standard Borel space \(\mathcal X\). In fact, the most interesting case for us is when \(P=P_{XY}\) is a joint probability distribution of random variables \(X\) and \(Y\), and \(Q=P_X\otimes P_Y\) is the product of marginal distributions, but some extra generality won’t harm us.
We will assume that \(P\ll Q\), so that we have a well-defined Radon–Nikodym derivative \(f = \mathrm{d}P/\mathrm{d}Q \colon \mathcal X\to [0, \infty)\), which is a measurable function. Note that it is defined only up to a \(Q\)-null set (and, as \(P\ll Q\), every \(Q\)-null set is also a \(P\)-null set). By appropriately extending the logarithm function, one can define a measurable function \(\log f\colon \mathcal{S}\to [-\infty, \infty) = \mathbb R \cup \{-\infty\}\), which appears in the well-known definition of the Kullback–Leibler divergence: \[ \mathrm{KL}(P\parallel Q) = \int f\, \log f\, \mathrm{d}Q = \int \log f\, \mathrm{d}P. \]
It is easy to show that this definition does not depend on the version of \(f\) used.
PMI profile
However, we will be more interested in the “histogram of \(\log f\) values”. By the pointwise mutual information profile (perhaps we should call it pointwise log-density ratio profile, but let’s use the former name) we will understand the pushforward distribution \[ \mathrm{Prof}_{P\parallel Q} := (\log f)_\sharp P, \] which (a) seems to be defined on \([-\infty, \infty) = \mathbb R \cup \{-\infty\}\) and (b) doesn’t have to actually exist, as \(\log f\) is defined only up to a \(Q\)-null set!
In fact neither of these issues is serious. To show that, let’s write \(\mathrm{Prof}\) for the profile and note that \[ \begin{align*} \mathrm{Prof}(-\{\infty\}) &= P( \{s\in \mathcal S\mid \log f(s) = -\infty\}) \\ &= P(\{s\in \mathcal S\mid f(s) = 0\}) = 0, \end{align*} \] because we can write \(P(E) = \int_E f\, \mathrm{d}Q\).
The proof that the profile doesn’t really depend on which version of \(f\) we use is also easy: if \(g = f\) up to a \(Q\)-null set, we have for any Borel subset \(B\in [-\infty, \infty)\) the equality \(\mathbf{1}_B(\log g) = \mathbf{1}_B(\log f)\) up to a \(P\)-null set (remember, we have \(P \ll Q\)) and the measure assigned to it, \(\mathrm{Prof}(B) = \int \mathbf{1}_B( \log f ) \, \mathrm{d}P\), is the same.
Invariance of the profile
Let’s now prove the invariance of the profile: we want to show that under reasonable assumptions on \(i\colon \mathcal X\to \mathcal X'\) there exists a profile of the push-forward distributions \(\mathrm{Prof}_{i_\sharp P\parallel i_\sharp Q}\) and that, in fact, it is the same as the original profile \(\mathrm{Prof}_{P\parallel Q}\).
Great! Now we know that the profile \(\mathrm{Prof}_{i_\sharp P \parallel i_\sharp Q }\) is indeed well-defined. The proof of the invariance is then very simple:
How often do measurable left inverses exist?
Above we assumed the existence of a measurable left inverse, while it’s common to see an assumption of using a continuous injective mapping (which is a very convenient criterion as it’s easy to verify). Fortunately, as we work with standard Borel spaces, we can use the following result:
What are the profiles good for?
In our manuscript we studied the PMI profiles for the following reasons:
- As they are invariant to continuous injective mappings, it turns out that our Beyond normal paper had only a few “really different” distributions.
- The PMI profiles seem to be related to the estimation of mutual information using variational losses.
- When we were trying to understand the mutual information in the Student distribution, we decided to use a Monte Carlo estimator of mutual information, essentially constructing the PMI profiles as a byproduct. This idea then turned out to be an interesting one for building distributions for which analytical expressions for ground-truth MI are not available, but can be obtained via Monte Carlo approximations.
- We felt that if the variance of the PMI profile is large, the mutual information (being the mean) may be hard to estimate, on the basis of the Monte Carlo standard error. However, I don’t have a good intuition whether it is true or not.
Overall, are they a useful concept? I’m not so sure: let’s give it some time and see what the community decides! I would like to see the PMI profiles appearing in more contexts, but perhaps looking at just the first moment (the mutual information) is enough for all the purposes.
I also think that it may be possible to generalize the PMI profiles to the \(f\)-divergence setting. There exist variational lower bounds, which are related to \(f\)-GANs, but I can’t say yet whether introducing an “\(f\)-divergence” profile would yield any practical benefits. (Or even how to define it: in the end, many pushforwards could be defined, but it does not necessarily mean that they have to be good objects to study!)