Note
Go to the end to download the full example code.
Feurer et al. (2015)¶
A tutorial on how to get the datasets used in the paper introducing Auto-sklearn by Feurer et al..
Auto-sklearn website: https://automl.github.io/auto-sklearn/
Publication¶
# License: BSD 3-Clause
import pandas as pd
import openml
List of dataset IDs given in the supplementary material of Feurer et al.: https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning-supplemental.zip fmt: off
dataset_ids = [
3, 6, 12, 14, 16, 18, 21, 22, 23, 24, 26, 28, 30, 31, 32, 36, 38, 44, 46,
57, 60, 179, 180, 181, 182, 184, 185, 273, 293, 300, 351, 354, 357, 389,
390, 391, 392, 393, 395, 396, 398, 399, 401, 554, 679, 715, 718, 720, 722,
723, 727, 728, 734, 735, 737, 740, 741, 743, 751, 752, 761, 772, 797, 799,
803, 806, 807, 813, 816, 819, 821, 822, 823, 833, 837, 843, 845, 846, 847,
849, 866, 871, 881, 897, 901, 903, 904, 910, 912, 913, 914, 917, 923, 930,
934, 953, 958, 959, 962, 966, 971, 976, 977, 978, 979, 980, 991, 993, 995,
1000, 1002, 1018, 1019, 1020, 1021, 1036, 1040, 1041, 1049, 1050, 1053,
1056, 1067, 1068, 1069, 1111, 1112, 1114, 1116, 1119, 1120, 1128, 1130,
1134, 1138, 1139, 1142, 1146, 1161, 1166,
]
# fmt: on
The dataset IDs could be used directly to load the dataset and split the data into a training set and a test set. However, to be reproducible, we will first obtain the respective tasks from OpenML, which define both the target feature and the train/test split.
Note
It is discouraged to work directly on datasets and only provide dataset IDs in a paper as this does not allow reproducibility (unclear splitting). Please do not use datasets but the respective tasks as basis for a paper and publish task IDS. This example is only given to showcase the use of OpenML-Python for a published paper and as a warning on how not to do it. Please check the OpenML documentation of tasks if you want to learn more about them.
This lists both active and inactive tasks (because of status='all'
). Unfortunately,
this is necessary as some of the datasets contain issues found after the publication and became
deactivated, which also deactivated the tasks on them. More information on active or inactive
datasets can be found in the online docs.
tasks = openml.tasks.list_tasks(
task_type=openml.tasks.TaskType.SUPERVISED_CLASSIFICATION,
status="all",
output_format="dataframe",
)
# Query only those with holdout as the resampling startegy.
tasks = tasks.query('estimation_procedure == "33% Holdout set"')
task_ids = []
for did in dataset_ids:
tasks_ = list(tasks.query("did == {}".format(did)).tid)
if len(tasks_) >= 1: # if there are multiple task, take the one with lowest ID (oldest).
task_id = min(tasks_)
else:
raise ValueError(did)
# Optional - Check that the task has the same target attribute as the
# dataset default target attribute
# (disabled for this example as it needs to run fast to be rendered online)
# task = openml.tasks.get_task(task_id)
# dataset = task.get_dataset()
# if task.target_name != dataset.default_target_attribute:
# raise ValueError(
# (task.target_name, dataset.default_target_attribute)
# )
task_ids.append(task_id)
assert len(task_ids) == 140
task_ids.sort()
# These are the tasks to work with:
print(task_ids)
[233, 236, 242, 244, 246, 248, 251, 252, 253, 254, 256, 258, 260, 261, 262, 266, 273, 275, 288, 2117, 2118, 2119, 2120, 2122, 2123, 2350, 3043, 3044, 75090, 75092, 75093, 75098, 75099, 75100, 75103, 75104, 75105, 75106, 75107, 75108, 75111, 75112, 75113, 75114, 75115, 75116, 75117, 75119, 75120, 75121, 75122, 75125, 75126, 75129, 75131, 75133, 75136, 75137, 75138, 75139, 75140, 75142, 75143, 75146, 75147, 75148, 75149, 75150, 75151, 75152, 75153, 75155, 75157, 75159, 75160, 75161, 75162, 75163, 75164, 75165, 75166, 75168, 75169, 75170, 75171, 75172, 75173, 75174, 75175, 75176, 75179, 75180, 75182, 75183, 75184, 75185, 75186, 75188, 75189, 75190, 75191, 75192, 75194, 75195, 75196, 75197, 75198, 75199, 75200, 75201, 75202, 75203, 75204, 75205, 75206, 75207, 75208, 75209, 75210, 75212, 75213, 75216, 75218, 75220, 75222, 75224, 75226, 75228, 75229, 75233, 75238, 75240, 75244, 75245, 75246, 75247, 75248, 75249, 75251, 190400]
Total running time of the script: (0 minutes 3.373 seconds)