Talk

Writing a scikit-learn compatible estimator in the modern age

LanguageEnglish
Audience levelAdvanced
Elevator pitch

Let’s live code and develop some scikit-learn compatible {meta, simple}-estimators together. This talk covers recent API developments in scikit-learn, and gives you the tools to handle scikit-learn API requirements. We also show how to be compatible across scikit-learn versions.

Abstract

In many data science related tasks, the use-case specific requirements require us to slightly manipulate the behavior of some of the estimators present in scikit-learn. Logging, enriching data through a connection with a database, adding domain knowledge to the data, etc, are all examples of such situations.

Developing a scikit-learn compatible estimator might at a first glance seem very straight forward if one limits the API to fit and predict methods. However, an estimator needs to implement quite a bit more to work well with scikit-learn tools and meta-estimators. Enabling features such as metadata routing or pipeline’s get_feature_names_out also require extra attention.

Some of the tips and requirements in this regard are not necessarily well document ed by the library, and it can be cumbersome to find those details. In this hands-on session you learn how to write your own scikit-learn estimator (transformer, regressor, classifier, or a meta-estimator) which can be used in asklearn.Pipeline, and works seamlessly with the other meta-estimators of the library such as GridSearchCV. It also includes how they can be conveniently tested with an extensive set of tests included in scikit-learn.

There have also been recent developments related to the general API of the estimators which require slight modifications by the third party developers. Many previously private utilities are now public and loosely categorised under a “developer API” umbrella, and we’ll use them in our estimators in the session.

TagsMachine-Learning, APIs
Participant

Adrin Jalali

Adrin is a maintainer of a few open source projects, including scikit-learn, fairlearn, and skops. He has a PhD in computational biology and has been working in the open source space for the past 6 years.