Bayesian quantification with black-box estimators

A. Ziegler, P. Czyż

Transactions on Machine Learning Research (2024)

Premise

We have an unlabeled data set and an imperfect classifier. Can we provide Bayesian estimates of the class prevalence vector?

Abstract

Understanding how different classes are distributed in an unlabeled data set is an important challenge for the calibration of probabilistic classifiers and uncertainty quantification. Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier trained on a different (shifted) data set to estimate the class distribution and yield asymptotic guarantees under weak assumptions. We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian model, approximating the assumed ground-truth generative process. Then, we discuss an efficient Markov Chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the state of the art.

Errata

Lemma A.3 should read \(\int_{\mathcal X} \left| \sum_{y\in \mathcal{Y} } \lambda_y k_y(x) \right|\, \mathrm{d}\mu(x)\), rather than \(\int_{\mathcal X} \left| \sum_{y\in \mathcal{Y} } k_y(x) \right| \, \mathrm{d}\mu(x)\).
In November 2024, I found in JASA an article by Fiksel et al. (2021), which in turns links to the Biostatistics paper of Datta et al. (2018). I think these articles are very related and important, but are missing from our literature review in Section 2.

Citation

@article{bayesian-quantification,
      title={Bayesian Quantification with Black-Box Estimators}, 
      author={Albert Ziegler and Pawe{\l} Czy{\.z}},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2024},
      url={https://openreview.net/forum?id=Ft4kHrOawZ}
}

Acknowledgments

I had a lot of great time working on this! This is mostly due to Albert Ziegler and Ian Wright, who were wonderful mentors to work with. I am also grateful to Semmle/GitHub UK and the ETH AI Center for funding this work.

Behind the scenes

I’m grateful for this experience, especially as this was my entry point to probabilistic machine learning and statistical inference. There were quite a few versions of this paper and it may be tricky to see how the project has changed, but it started with asymptotic consistency guarantees in large-data limit, then we moved to a version of a maximum likelihood approach (and then to a penalised maximum likelihood, to regularise the predictions), and finally we formalised the problem in terms of probabilistic graphical models and employed Bayesian inference.
The project started in July 2018 and was accepted for publication in May 2024, what (as of June 2024) makes it my personal record for the time needed to publish a project. When I don’t want to feel to bad about this, I tell myself that (a) I needed to learn many new things to be able to reach the final version of the paper; (b) for most of this time it was a side project, and (c) hey, as the great Leslie Lamport explains on his website, he submitted the paper introducing Paxos algorithm for the first time in 1990 and finally published it in 1998. Of course, Paxos is in an entirely different league than our paper, but it’s good to know (especially for an early-career researcher) that even the greatest scientists with the best ideas also sometimes have to spend years trying to publish.