OpenML

Installation¶

The OpenML package is available in many languages and has deep integration in many machine learning libraries.

Python/sklearnPytorchTensorFlowRJuliaRUST.Net

Python/sklearn repository
pip install openml

Pytorch repository
pip install openml-pytorch

TensorFlow repository
pip install openml-tensorflow

R repository
install.packages("mlr3oml")

Julia repository
using Pkg;Pkg.add("OpenML")

RUST repository
Install from source

.Net repository
Install-Package openMl

You can find detailed guides for the different libraries in the top menu.

Authentication¶

OpenML is entirely open and you do not need an account to access data (rate limits apply). However, signing up via the OpenML website is very easy (and free) and required to upload new resources to OpenML and to manage them online.

API authentication happens via an API key, which you can find in your profile after logging in to openml.org.

1	`openml.config.apikey = "YOUR KEY"`

Minimal Example¶

Use the following code to load the credit-g dataset directly into a pandas dataframe. Note that OpenML can automatically load all datasets, separate data X and labels y, and give you useful dataset metadata (e.g. feature names and which ones have categorical data).

import openml

dataset = openml.datasets.get_dataset("credit-g") # or by ID get_dataset(31)
X, y, categorical_indicator, attribute_names = dataset.get_data(target="class")

Get a task for supervised classification on credit-g. Tasks specify how a dataset should be used, e.g. including train and test splits.

task = openml.tasks.get_task(31)
dataset = task.get_dataset()
X, y, categorical_indicator, attribute_names = dataset.get_data(target=task.target_name)
# get splits for the first fold of 10-fold cross-validation
train_indices, test_indices = task.get_train_test_split_indices(fold=0)

Use an OpenML benchmarking suite to get a curated list of machine-learning tasks:

suite = openml.study.get_suite("amlb-classification-all")  # Get a curated list of tasks for classification
for task_id in suite.tasks:
    task = openml.tasks.get_task(task_id)

You can now benchmark your models easily across many datasets at once. A model training is called a run:

from sklearn import neighbors

task = openml.tasks.get_task(403)
clf = neighbors.KNeighborsClassifier(n_neighbors=5)
run = openml.runs.run_model_on_task(clf, task)

You can now publish your experiment on OpenML so that others can build on it:

myrun = run.publish()
print(f"kNN on {data.name}: {myrun.openml_url}")

Learning more OpenML¶

Next, check out the 10 minute tutorial and the short description of OpenML concepts.