Skip to content

OpenML

💻 Installation

The OpenML package is available in many languages and has deep integration in many machine learning libraries.

You can find detailed guides for the different libraries in the top menu.

🔑 Authentication

OpenML is entirely open and you do not need an account to access data (rate limits apply). However, signing up via the OpenML website is very easy (and free) and required to upload new resources to OpenML and to manage them online.

API authentication happens via an API key, which you can find in your profile after logging in to openml.org.

openml.config.apikey = "YOUR KEY"

🕹 Minimal Example

Use the following code to load the credit-g dataset directly into a pandas dataframe. Note that OpenML can automatically load all datasets, separate data X and labels y, and give you useful dataset metadata (e.g. feature names and which ones have categorical data).

1
2
3
4
import openml

dataset = openml.datasets.get_dataset("credit-g") # or by ID get_dataset(31)
X, y, categorical_indicator, attribute_names = dataset.get_data(target="class")

🏆 Get a task for supervised classification on credit-g. Tasks specify how a dataset should be used, e.g. including train and test splits.

1
2
3
4
5
task = openml.tasks.get_task(31)
dataset = task.get_dataset()
X, y, categorical_indicator, attribute_names = dataset.get_data(target=task.target_name)
# get splits for the first fold of 10-fold cross-validation
train_indices, test_indices = task.get_train_test_split_indices(fold=0)

📊 Use an OpenML benchmarking suite to get a curated list of machine-learning tasks:

1
2
3
suite = openml.study.get_suite("amlb-classification-all")  # Get a curated list of tasks for classification
for task_id in suite.tasks:
    task = openml.tasks.get_task(task_id)

🌟 You can now benchmark your models easily across many datasets at once. A model training is called a run:

1
2
3
4
5
from sklearn import neighbors

task = openml.tasks.get_task(403)
clf = neighbors.KNeighborsClassifier(n_neighbors=5)
run = openml.runs.run_model_on_task(clf, task)

🙌 You can now publish your experiment on OpenML so that others can build on it:

myrun = run.publish()
print(f"kNN on {data.name}: {myrun.openml_url}")

Learning more OpenML

Next, check out the 🚀 10 minute tutorial and the 🎓 short description of OpenML concepts.