As scikit-learn maintainers, we would love to use PyPI download stats and other similar metrics to help inform some of our decisions.

In this talk we will highlight a number of caveats we discovered while trying to understand the complex reality behind these seemingly simple metrics.

We all love to tell stories with data and we all love to listen to them.

As package maintainers or package users, we resort to proxy metrics (Github stars, PyPI download stats, website analytics …) to try to help answer inherently hard questions like these:

which package should I install to address this question, this one or this other one?
which package are cool kids using these days?
how much is our package actually being used?
did our latest features have any impact on adoption?

In the context of scikit-learn, we will present the kind of surprises and caveats we discovered when trying to make sense of the PyPI download stats.

Highlights include:

the most downloaded scikit-learn release is from 5 years ago, maybe people actually don’t care about our latest developments?
how on earth can a package that errors on install be downloaded 50_000 times a day?
is there any hope to differentiate “real users” vs “automation users” (e.g. Continuous Integration)?

We will then zoom out a bit and talk about other metrics we looked at, for example scikit-learn.org website analytics, GitHub stars and “Used by” stats. After presenting the inherent biases of these datasources, we will summarize the kind of insights we gained by combining them.

During the presentation, we will also highlight a few tools and websites we used along the journey to make it easier to look at PyPI download stats numbers in more details.

We will conclude with some thoughts about how to combine this kind of metrics to inform some of our decisions, while at the same time not falling in love too much with the stories we tell with them.

PyPI in the face: running jokes that PyPI download stats can play on you

Loïc Estève