The Dirichlet process

In an informal fashion we introduce several constructions of the Dirichlet process.
Author

Paweł Czyż

Published

July 11, 2023

In this post we will quickly review different constructions of the Dirichlet process, following Teh et al. (2006) and Gelman et al. (2013, chap. 23).

Finite-dimensional Dirichlet prior

Consider the simplest Gaussian mixture model: there are several normal distributions with unit variance \(\mathcal N(\mu_k, 1)\) for \(k\in \{1, \dotsc, K\}\) and mixture proportions vector \(\pi = (\pi_1, \dotsc, \pi_K)\) with \(\pi_k\ge 0\) and \(\sum_k \pi_k=1\).

A convenient prior for \(\pi\) is the Dirichlet distribution. We put some \(F_0\) prior on the parameters \(\{\mu_k\}\) of the model, so the generative process looks like: \[\begin{align*} \pi \mid \alpha &\sim \mathrm{Dirichlet}(\alpha_1, \dotsc, \alpha_K)\\ \mu_k \mid F_0 &\sim F_0, & k=1, \dotsc, K\\ Z_n \mid \pi &\sim \mathrm{Categorical}(\pi_1, \dotsc, \pi_K), & n=1, \dotsc, N\\ X_n\mid Z_n=z_n, \{\mu_k\} &\sim \mathcal N(\mu_{z_n}, 1),\quad & n=1, \dotsc, N. \end{align*}\]

Another point of view

Rather than using individual random variables \(Z_n\) and a shared set of parameters \(\{\mu_k\}\) we could reparametrize the model to use individual means \(\tilde \mu_n = \mu_{Z_n}\). In other words, we could consider a probability measure with atoms ${_k}$ given by \[F = \sum_{k=1}^K \pi_k \delta_{\mu_k}.\]

If we only know the Dirichlet weights vector \((\alpha_1, \dotsc, \alpha_K)\) and the base measure \(F_0\) we can think of \(F\) as of a random probability measure generated according to \[\begin{align*} \pi \mid \alpha &\sim \mathrm{Dirichlet}(\alpha_1, \dotsc, \alpha_K)\\ \mu_k &\sim F_0, \quad k = 1, \dotsc, K\\ F &:= \sum_{k=1}^K \pi_k \delta_{\mu_k}. \end{align*}\]

Then sampling individual data points amounts to the following model with \(n=1, \dotsc, N\): \[\begin{align*} F\mid \alpha, F_0 &\sim \text{the procedure above}\\ \theta_n \mid F &\sim F, \\ X_n\mid \theta_n &\sim \mathcal N(\theta_n, 1). \end{align*}\]

Note that the values of \(\theta_n\) come from the set \(\{\mu_1, \dotsc, \mu_K\}\) as \(F\) is atomic.

Dirichlet process prior

Stick-breaking construction

With the following example in mind we will pass now to a general distribution \(F_0\) defined over some infinite space \(\mathcal M\) (which can be \(\mathbb R\) as above) and a single positive parameter \(\alpha > 0\).

We will generate a random measure \(F\) from \(F_0\) using the construction known as the Dirichlet process.

Sample for \(k=1, 2, \dotsc\) \[\begin{align*} v_k \mid \alpha &\sim \mathrm{Beta}(1, \alpha)\\ \mu_k \mid F_0 &\sim F_0 \end{align*}\] and define \[\begin{align*} p_1 &= v_1\\ p_k &= v_k \prod_{i=1}^{k-1} (1-v_k) \quad \text{for } k\ge 2,\\ F &= \sum_{k=1}^\infty p_k \delta_{\mu_k} \end{align*}\]

With probability 1 it holds that \[\sum_{k=1}^\infty p_k = 1,\] i.e., \((p_k)\) is a valid proportions vector.

We say that the distribution \(F\) was drawn from the Dirichlet process: \[F \sim \mathrm{DP}(\alpha, F_0).\]

Infinite limit

The atomic distributions generated with finite-dimensional proportions \((\pi_k)_{k=1, 2, \dotsc, K}\) and infinite sequence of weights \((p_k)_{k=1, 2, \dotsc, \infty}\) look optically similar. There is a close relation between these two generative processes.

Consider a random measure \(F^K\) defined using a symmetric Dirichlet distribution: \[\begin{align*} \pi^K \mid \alpha &\sim \mathrm{Dirichlet}(\alpha/K, \cdots, \alpha/K)\\ \mu^K_k \mid F_0 &\sim F_0\\ F^K &= \sum_{k=1}^K \pi^K_k\delta_{\mu^K_k} \end{align*}\]

Now if \(F^{\infty} \sim \mathrm{DP}(\alpha, F_0)\) and \(u\) is any measurable function integrable with respect to \(F_0\), then the sequence of random variables \[ \int_{\mathcal M} u\, \mathrm{d} F^{K} \] converges in distribution (that is, weakly) to \[ \int_{\mathcal M} u\, \mathrm{d} F^{\infty}.\]

Where the difference really is

We see that \((p_k)\) looks deceptively similar as \((\pi_k^K)\) for large \(K\). There are some differences, though. First of all, \((p_k)\) is infinite and the number of atoms appearing in the analysis of a particular data set is implicitly controlled by the number of data points. If \(F_0\) is non-atomic,one can expect \(O(\alpha\log N)\) atoms in a data set with \(N\) points. In the finite-dimensional case more than \(K\) clusters are impossible.

However, for \(K\gg N\) it’s natural to expect that several entries from \((\pi^K_k)\) should be matching several entries of \((p_k)\). However, the intuition that \(p_1 = \pi_1^K\), \(p_2 = \pi_2^K\), … is wrong. In the stick-breaking construction of the Dirichlet process we expect the first few entries to have the most of the mass, while in the finite-dimensional case the Dirichlet prior is symmetric — we don’t know which weights \(\pi_k^K\) will have vanishing mass.

Although it seems obvious I spent quite some time trying to understand why the stick-breaking sampling procedure from the Dirichlet distribution gives different results!
The stick-breaking sampling procedure for the \(\mathrm{Dirichlet}(\alpha/K, \dotsc, \alpha/K)\) distribution works as follows: \[\begin{align*} u_k &\sim \mathrm{Beta}( \alpha/K, \alpha\cdot (1-k/K) )\\ \pi_1 &= u_1\\ \pi_k &= u_k \prod_{j < k} (1-u_k), \quad k = 2, \dotsc, K-1\\ \pi_K &= 1 - (\pi_1 + \dotsc + \pi_{K-1}) \end{align*}\]

which for \(k \ll K\) corresponds to sampling from (approximately) \(\mathrm{Beta}(\alpha/K, \alpha)\), rather than \(\mathrm{Beta}(1, \alpha)\).

Pitman (1996) describes size-biased permutations, which perhaps can be used to establish link between \((\pi_k)\) for large \(K\) and \((p_k)\), but I haven’t understood it yet.

Defining property

We have seen in what sense the Dirichlet process prior can be thought as of an infinite-dimensional generalization of the Dirichlet prior. However, there is another link.

Recall that the defining property of a Gaussian process is that it is a continuous-time stochastic process \(\{X_t\}_{t\in [0, 1]}\) such that for every finite set of indices \(t_1, t_2, \dotsc, t_m\) random vector \((X_{t_1}, \dotsc, X_{t_m})\) is distributed according to multivariate normal distribution. (In particular every \(X_t\) is a normal random variable). While this defining property is not sufficient without a proof of existence (e.g., an explicit construction), it is useful in many calculations involving them.

We will now give the defining property of the Dirichlet process. Take a probability measure \(F_0\) over \(\mathcal M\) and the concentration parameter \(\alpha > 0\). We say that \(\mathrm{DP}(\alpha, F_0)\) is a Dirichlet process if every sample \(F\sim \mathrm{DP}(\alpha, F_0)\) is a probability measure over \(\mathcal M\) such that for every partition \(A_1, \cdots, A_m\) of \(\mathcal M\) the following holds: \[ \left( F(A_1), \dotsc, F(A_K) \right) \sim \mathrm{Dirichlet}\big(\alpha F_0(A_1), \dotsc, \alpha F_0(A_K) \big) \]

In particular if \(A\subseteq \mathcal X\) is any measurable subset, then we can use the partition \(\{A, \mathcal M\setminus A\}\) to get \[ F(A) \sim \mathrm{Beta}\big( \alpha F_0(A), \alpha(1-F_0(A)) \big),\] so that \[\mathbb E[ F(A) ] = F_0(A)\] and \[\mathrm{Var}[F(A)] = \frac{ F_0(A)\cdot (1-F_0(A)) }{1+\alpha}\]

Hence, each draw \(F\) is centered around \(F_0\) and the variance is small for large parameter values \(\alpha\).

Pólya urn scheme

Finally, we give an interpretation in terms of Pólya urn scheme.

Above we considered the sampling process from the finite-dimensional Dirichlet distribution: \[\begin{align*} F\mid \alpha, F_0 &\sim \text{construct atomic measure},\\ \theta_n \mid F &\sim F, \end{align*}\] where each of the \(\theta_n\) was actually some atom of the distribution \(\mu_k\).

This interpretation is also easy to understand when the atomic measure \(F\) is drawn from the Dirichlet process using the stick-breaking construction.

Consider now a sampling procedure of \(\theta_n\) where we do not have direct access to \(F\), but only to the distribution \(F_0\), concentration parameter \(\alpha > 0\) and previous draws \(\theta_1, \dotsc, \theta_{n-1}\). It holds that \[\theta_n \mid \alpha, F_0, \theta_1, \dotsc, \theta_{n-1} \sim \frac{\alpha}{ (n-1) + \alpha }F_0 + \sum_{u=1}^{n-1} \frac{1}{(n-1)+\alpha}\delta_{ \theta_n }.\]

If \(\alpha\) is a positive integer we can interpret this sampling procedure as follows: we want to draw the \(n\)th ball and we have an urn with \(\alpha\) transparent balls and \(n-1\) balls of different colors. We draw a random ball. If it is transparent, we use \(G_0\) to sample a colored ball from \(F_0\), note it down, and put it to the urn.

This also suggests a clustering property: if there is a color \(\mu_k\) such that there are already \(m_k\) balls inside the urn (i.e., \(m_k\) is the number of indices \(1\le i\le n-1\) such that \(\theta_i = \mu_k\)), then we have a larger chance to draw a ball of this color: \[\theta_n \mid \alpha, F_0, \theta_1, \dotsc, \theta_{n-1} \sim \frac{\alpha}{(n-1) + \alpha}F_0 + \sum_{k} \frac{m_k}{ (n-1) + \alpha } \delta_{ \mu_k }.\]

We also see that for the concentration parameter \(\alpha \gg n\) this sampling procedure approximates independent sampling from \(F_0\).

Asymptotic number of clusters

The above formulation can be used to argue why the number of clusters grows as \(O(\alpha\log n)\) if \(F_0\) is non-atomic. Define \(D_1 = 1\) and for \(n\ge 1\) \[D_n = \begin{cases} 1 &\text{ if } \theta_n \notin \{\theta_1, \dotsc, \theta_{n-1}\}\\ 0 &\text{otherwise}\end{cases}\] From the above construction we know the probability of drawing a new atom, so \[\mathbb E[D_n] = \alpha / (\alpha + n-1)\] The number of distinct atoms in \(F\) is then \[\mathbb E[C_n] = \mathbb E[D_1 + \dotsc + D_n] = \alpha \sum_{i=1}^n \frac{1}{\alpha + n - 1}.\] We recognise that this sum is similar to the harmonic series and (this can be proven formally) also grows as \(O(\log n)\), so that \(\mathbb E[C_n] = O(\alpha\log n)\). To provide a more precise result: \[\lim_{n\to\infty}\frac{ \mathbb E[C_n] }{\alpha \log n} = 1.\] For this and related results consult these notes.

Chinese restaurant process

The procedure above is also closely related to the Chinese restaurant process, where the metaphor is that there are \(K\) occupied tables (where a dish \(\mu_k\) is served) and there are \(m_k\) people sitting around the \(k\)th table. When a new customer enters the restaurant, they can either join an existing table (with probability proportional to \(m_k\)) or start a new (\((K+1)\)th) table with probability proportional to \(\alpha\), where a new dish \(\mu_{K+1}\sim F_0\) will be served.

Afterword

The Dirichlet process is a useful construction, which can be used as a nonparametric prior in clustering problems — instead of specifying a fixed number of clusters one can specify the growth rate via \(\alpha\).

In practice the results (including the number of inferred clusters in a particular data set) need to be treated with caution: the clusters found in the data set do not need to have the “real world” meaning (or perhaps “clusters” are a wrong abstration at all, with heterogeneity attributable e.g., to some continuous covariates which could be measured). Careful validation is often needed, epsecially that these models may be non-robust to misspecification (see this paper on coarsening) or the inferences may be hard to do (see this overview of their intrinsic non-identifiability).

Anyway, although difficult, clustering can provide useful information about a given problem, so we often need do it. For example, take a look at this application of Chinese restaurant process to the clustering of single-cell DNA profiles.

References

Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis. https://books.google.co.uk/books?id=ZXL6AQAAQBAJ.
Pitman, Jim. 1996. “Random Discrete Distributions Invariant Under Size-Biased Permutation.” Advances in Applied Probability 28 (2): 525–39. https://doi.org/10.2307/1428070.
Teh, Yee Whye, Michael I Jordan, Matthew J Beal, and David M Blei. 2006. “Hierarchical Dirichlet Processes.” Journal of the American Statistical Association 101 (476): 1566–81. https://doi.org/10.1198/016214506000000302.