Note
Click here to download the full example code
Tasks¶
A tutorial on how to list and download tasks.
import openml
import pandas as pd
from pprint import pprint
Tasks are identified by IDs and can be accessed in two different ways:
1. In a list providing basic information on all tasks available on OpenML. This function will not download the actual tasks, but will instead download meta data that can be used to filter the tasks and retrieve a set of IDs. We can filter this list, for example, we can only list tasks having a special tag or only tasks for a specific target such as supervised classification.
2. A single task by its ID. It contains all meta information, the target metric, the splits and an iterator which can be used to access the splits in a useful manner.
Listing tasks¶
We will start by simply listing only supervised classification tasks:
tasks = openml.tasks.list_tasks(task_type_id=1)
openml.tasks.list_tasks() returns a dictionary of dictionaries, we convert it into a pandas dataframe to have better visualization and easier access:
tasks = pd.DataFrame.from_dict(tasks, orient='index')
print(tasks.columns)
print("First 5 of %s tasks:" % len(tasks))
pprint(tasks.head())
# The same can be obtained through lesser lines of code
tasks_df = openml.tasks.list_tasks(task_type_id=1, output_format='dataframe')
pprint(tasks_df.head())
Out:
Index(['tid', 'ttid', 'did', 'name', 'task_type', 'status',
'estimation_procedure', 'evaluation_measures', 'source_data',
'target_feature', 'MajorityClassSize', 'MaxNominalAttDistinctValues',
'MinorityClassSize', 'NumberOfClasses', 'NumberOfFeatures',
'NumberOfInstances', 'NumberOfInstancesWithMissingValues',
'NumberOfMissingValues', 'NumberOfNumericFeatures',
'NumberOfSymbolicFeatures', 'cost_matrix'],
dtype='object')
First 5 of 3278 tasks:
tid ttid ... NumberOfSymbolicFeatures cost_matrix
2 2 1 ... 33.0 NaN
3 3 1 ... 37.0 NaN
4 4 1 ... 9.0 NaN
5 5 1 ... 74.0 NaN
6 6 1 ... 1.0 NaN
[5 rows x 21 columns]
tid ttid ... NumberOfSymbolicFeatures cost_matrix
2 2 1 ... 33.0 NaN
3 3 1 ... 37.0 NaN
4 4 1 ... 9.0 NaN
5 5 1 ... 74.0 NaN
6 6 1 ... 1.0 NaN
[5 rows x 21 columns]
We can filter the list of tasks to only contain datasets with more than 500 samples, but less than 1000 samples:
filtered_tasks = tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000')
print(list(filtered_tasks.index))
Out:
[2, 11, 15, 29, 37, 41, 49, 53, 232, 241, 245, 259, 267, 271, 279, 283, 1766, 1775, 1779, 1793, 1801, 1805, 1813, 1817, 1882, 1891, 1895, 1909, 1917, 1921, 1929, 1933, 1945, 1952, 1956, 1967, 1973, 1977, 1983, 1987, 2079, 2125, 2944, 3022, 3034, 3047, 3049, 3053, 3054, 3055, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 4189, 4191, 4197, 4198, 4199, 4217, 4223, 4225, 4226, 4234, 4240, 4254, 4265, 4266, 4288, 4328, 4341, 4345, 4365, 4395, 4396, 4397, 4409, 4411, 4423, 4499, 4508, 4515, 4517, 4518, 4519, 4522, 4538, 4557, 4558, 4562, 4565, 4572, 4582, 4584, 4591, 4618, 4676, 4684, 4697, 4704, 7286, 7307, 7543, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 12738, 12739, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146574, 146576, 146577, 146578, 146583, 146587, 146588, 146593, 146596, 146597, 146600, 146818, 146819, 166859, 166875, 166882, 166884, 166893, 166905, 166906, 166907, 166913, 166915, 166919, 166947, 166953, 166956, 166957, 166958, 166959, 166960, 166967, 166976, 166977, 166978, 166980, 166983, 166988, 166989, 166992, 167016, 167020, 167031, 167037, 167062, 167067, 167068, 167095, 167096, 167100, 167104, 167106, 167151, 167154, 167160, 167163, 167167, 167168, 167171, 167173, 167174, 167175, 167180, 167184, 167187, 167194, 167198, 168300, 168783, 168819, 168820, 168821, 168822, 168823, 168824, 168825, 168907, 189786, 189859, 189899, 189900, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146]
# Number of tasks
print(len(filtered_tasks))
Out:
279
Then, we can further restrict the tasks to all have the same resampling strategy:
filtered_tasks = filtered_tasks.query('estimation_procedure == "10-fold Crossvalidation"')
print(list(filtered_tasks.index))
Out:
[2, 11, 15, 29, 37, 41, 49, 53, 2079, 3022, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 7286, 7307, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146818, 146819, 168300, 168907, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146]
# Number of tasks
print(len(filtered_tasks))
Out:
113
Resampling strategies can be found on the OpenML Website.
Similar to listing tasks by task type, we can list tasks by tags:
tasks = openml.tasks.list_tasks(tag='OpenML100')
tasks = pd.DataFrame.from_dict(tasks, orient='index')
print("First 5 of %s tasks:" % len(tasks))
pprint(tasks.head())
Out:
First 5 of 91 tasks:
tid ttid ... NumberOfNumericFeatures NumberOfSymbolicFeatures
3 3 1 ... 0 37
6 6 1 ... 16 1
11 11 1 ... 4 1
12 12 1 ... 216 1
14 14 1 ... 76 1
[5 rows x 19 columns]
Furthermore, we can list tasks based on the dataset id:
tasks = openml.tasks.list_tasks(data_id=1471)
tasks = pd.DataFrame.from_dict(tasks, orient='index')
print("First 5 of %s tasks:" % len(tasks))
pprint(tasks.head())
Out:
First 5 of 16 tasks:
tid ttid did ... target_value evaluation_measures number_samples
9983 9983 1 1471 ... NaN NaN NaN
14951 14951 1 1471 ... NaN NaN NaN
56483 56483 8 1471 ... 1 NaN NaN
56484 56484 8 1471 ... 1 NaN NaN
56485 56485 8 1471 ... 1 NaN NaN
[5 rows x 23 columns]
In addition, a size limit and an offset can be applied both separately and simultaneously:
tasks = openml.tasks.list_tasks(size=10, offset=50)
tasks = pd.DataFrame.from_dict(tasks, orient='index')
pprint(tasks)
Out:
tid ttid ... NumberOfSymbolicFeatures number_samples
59 59 1 ... 1 NaN
60 60 1 ... 17 NaN
62 62 3 ... 33 9
63 63 3 ... 37 12
64 64 3 ... 9 1
65 65 3 ... 74 7
66 66 3 ... 70 5
67 67 3 ... 1 6
68 68 3 ... 11 5
69 69 3 ... 16 4
[10 rows x 21 columns]
OpenML 100 is a curated list of 100 tasks to start using OpenML. They are all supervised classification tasks with more than 500 instances and less than 50000 instances per task. To make things easier, the tasks do not contain highly unbalanced data and sparse data. However, the tasks include missing values and categorical features. You can find out more about the OpenML 100 on the OpenML benchmarking page.
Finally, it is also possible to list all tasks on OpenML with:
tasks = openml.tasks.list_tasks()
tasks = pd.DataFrame.from_dict(tasks, orient='index')
print(len(tasks))
Out:
19397
Downloading tasks¶
We provide two functions to download tasks, one which downloads only a single task by its ID, and one which takes a list of IDs and downloads all of these tasks:
task_id = 31
task = openml.tasks.get_task(task_id)
Properties of the task are stored as member variables:
pprint(vars(task))
Out:
{'class_labels': ['good', 'bad'],
'cost_matrix': None,
'dataset_id': 31,
'estimation_procedure': {'data_splits_url': 'https://www.openml.org/api_splits/get/31/Task_31_splits.arff',
'parameters': {'number_folds': '10',
'number_repeats': '1',
'percentage': '',
'stratified_sampling': 'true'},
'type': 'crossvalidation'},
'estimation_procedure_id': 1,
'evaluation_measure': None,
'split': None,
'target_name': 'class',
'task_id': 31,
'task_type': 'Supervised Classification',
'task_type_id': 1}
And:
ids = [2, 1891, 31, 9983]
tasks = openml.tasks.get_tasks(ids)
pprint(tasks[0])
Out:
<openml.tasks.task.OpenMLClassificationTask object at 0x7f15aa144c18>
Total running time of the script: ( 0 minutes 33.336 seconds)