Feurer et al. (2015)

A tutorial on how to get the datasets used in the paper introducing Auto-sklearn by Feurer et al..

Auto-sklearn website: https://automl.github.io/auto-sklearn/

Publication

Efficient and Robust Automated Machine Learning
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter
In Advances in Neural Information Processing Systems 28, 2015
# License: BSD 3-Clause

import pandas as pd

import openml

List of dataset IDs given in the supplementary material of Feurer et al.: https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning-supplemental.zip fmt: off

dataset_ids = [
    3, 6, 12, 14, 16, 18, 21, 22, 23, 24, 26, 28, 30, 31, 32, 36, 38, 44, 46,
    57, 60, 179, 180, 181, 182, 184, 185, 273, 293, 300, 351, 354, 357, 389,
    390, 391, 392, 393, 395, 396, 398, 399, 401, 554, 679, 715, 718, 720, 722,
    723, 727, 728, 734, 735, 737, 740, 741, 743, 751, 752, 761, 772, 797, 799,
    803, 806, 807, 813, 816, 819, 821, 822, 823, 833, 837, 843, 845, 846, 847,
    849, 866, 871, 881, 897, 901, 903, 904, 910, 912, 913, 914, 917, 923, 930,
    934, 953, 958, 959, 962, 966, 971, 976, 977, 978, 979, 980, 991, 993, 995,
    1000, 1002, 1018, 1019, 1020, 1021, 1036, 1040, 1041, 1049, 1050, 1053,
    1056, 1067, 1068, 1069, 1111, 1112, 1114, 1116, 1119, 1120, 1128, 1130,
    1134, 1138, 1139, 1142, 1146, 1161, 1166,
]
# fmt: on

The dataset IDs could be used directly to load the dataset and split the data into a training set and a test set. However, to be reproducible, we will first obtain the respective tasks from OpenML, which define both the target feature and the train/test split.

Note

It is discouraged to work directly on datasets and only provide dataset IDs in a paper as this does not allow reproducibility (unclear splitting). Please do not use datasets but the respective tasks as basis for a paper and publish task IDS. This example is only given to showcase the use of OpenML-Python for a published paper and as a warning on how not to do it. Please check the OpenML documentation of tasks if you want to learn more about them.

This lists both active and inactive tasks (because of status='all'). Unfortunately, this is necessary as some of the datasets contain issues found after the publication and became deactivated, which also deactivated the tasks on them. More information on active or inactive datasets can be found in the online docs.

tasks = openml.tasks.list_tasks(
    task_type=openml.tasks.TaskType.SUPERVISED_CLASSIFICATION,
    status="all",
    output_format="dataframe",
)

# Query only those with holdout as the resampling startegy.
tasks = tasks.query('estimation_procedure == "33% Holdout set"')

task_ids = []
for did in dataset_ids:
    tasks_ = list(tasks.query("did == {}".format(did)).tid)
    if len(tasks_) >= 1:  # if there are multiple task, take the one with lowest ID (oldest).
        task_id = min(tasks_)
    else:
        raise ValueError(did)

    # Optional - Check that the task has the same target attribute as the
    # dataset default target attribute
    # (disabled for this example as it needs to run fast to be rendered online)
    # task = openml.tasks.get_task(task_id)
    # dataset = task.get_dataset()
    # if task.target_name != dataset.default_target_attribute:
    #     raise ValueError(
    #         (task.target_name, dataset.default_target_attribute)
    #     )

    task_ids.append(task_id)

assert len(task_ids) == 140
task_ids.sort()

# These are the tasks to work with:
print(task_ids)
[233, 236, 242, 244, 246, 248, 251, 252, 253, 254, 256, 258, 260, 261, 262, 266, 273, 275, 288, 2117, 2118, 2119, 2120, 2122, 2123, 2350, 3043, 3044, 75090, 75092, 75093, 75098, 75099, 75100, 75103, 75104, 75105, 75106, 75107, 75108, 75111, 75112, 75113, 75114, 75115, 75116, 75117, 75119, 75120, 75121, 75122, 75125, 75126, 75129, 75131, 75133, 75136, 75137, 75138, 75139, 75140, 75142, 75143, 75146, 75147, 75148, 75149, 75150, 75151, 75152, 75153, 75155, 75157, 75159, 75160, 75161, 75162, 75163, 75164, 75165, 75166, 75168, 75169, 75170, 75171, 75172, 75173, 75174, 75175, 75176, 75179, 75180, 75182, 75183, 75184, 75185, 75186, 75188, 75189, 75190, 75191, 75192, 75194, 75195, 75196, 75197, 75198, 75199, 75200, 75201, 75202, 75203, 75204, 75205, 75206, 75207, 75208, 75209, 75210, 75212, 75213, 75216, 75218, 75220, 75222, 75224, 75226, 75228, 75229, 75233, 75238, 75240, 75244, 75245, 75246, 75247, 75248, 75249, 75251, 190400]

Total running time of the script: (0 minutes 42.348 seconds)

Gallery generated by Sphinx-Gallery