Sometimes if you misunderstand something, you can have two interesting distributions, rather than only one.
Author
Paweł Czyż
Published
August 31, 2023
Frederic, Alex and I have been discussing some experiments related to our work on mutual information estimators and Frederic suggested to look at one distribution. I misunderstood what he meant, but this mistake turned out to be quite an interesting object.
So let’s take a look at two distributions defined over a triangle \[T = \{ (x, y)\in (0, 1)\times (0, 1) \mid y < x \}\] and calculate their mutual information.
Uniform joint
Consider a probability distribution with constant probability density function (PDF) of the joint distribution: \[p_{XY}(x, y) = 2 \cdot \mathbf{1}[y<x].\]
We have \[p_X(x) = \int\limits_0^x p_{XY}(x, y)\, \mathrm{d}y = 2x\] and \[ p_Y(y) = \int\limits_0^1 p_{XY}(x, y) \mathbf{1}[y < x] \, \mathrm{d}x = \int\limits_y^1 p_{XY}(x, y) \, \mathrm{d}x = 2(1-y).\]
Hence, pointwise mutual information is given by \[ i(x, y) = \log \frac{ p_{XY}(x, y) }{p_X(x) \, p_Y(y) } = \log \frac{1}{2x(1-y)}\] and mutual information is
The above distribution is interesting, but when I heard about the distribution over the triangle, I actually had the following generative model in mind: \[\begin{align*}
X &\sim \mathrm{Uniform}(0, 1),\\
Y \mid X=x &\sim \mathrm{Uniform}(0, x).
\end{align*}\]
We have \(p_X(x) = 1\) and therefore \[p_{XY}(x, y) = p_{Y\mid X}(y\mid x) = \frac{1}{x}\,\mathbf{1}[y < x].\]
Again, this distribution is defined on the triangle \(T\), although now the joint is not uniform.
We have \[ p_Y(y) = \int\limits_y^1 \frac{1}{x} \, \mathrm{d}x = -\log y\] and \[i(x, y) = \log \frac{1}{-x \log y} = -\log \big(x\cdot (-\log y)\big )
= - \left(\log(x) + \log(-\log y) \right) = -\log x - \log(-\log y).\] This expression suggests that if \(p_Y(y)\) were uniform on \((0, 1)\) (but it is not), the pointwise mutual information \(i(x, Y)\) would be distributed according to Gumbel distribution.
The mutual information \[
I(X; Y) = -\int\limits_0^1 \mathrm{d}y \int\limits_y^1 \frac{ \log x + \log(-\log y)}{x} \, \mathrm{d}x = \frac{1}{2} \int\limits_0^1 \log y \cdot \log \left(y \log ^2 y\right) \, \mathrm{d}y = \gamma \approx 0.577
\] is in this case the Euler–Mascheroni constant. I don’t know how to do this integral, but both Mathematica and Wolfram Alpha seem to be quite confident in it.
Perhaps it shouldn’t be too surprising as \(\gamma\) can appears in expressions involving mean of the Gumbel distribution. However, I’d like to understand this connection better.
Perhaps another time; let’s finish this post with another visualisation: