Tasks

A tutorial on how to list and download tasks.

# License: BSD 3-Clause

import openml
import pandas as pd

Tasks are identified by IDs and can be accessed in two different ways:

  1. In a list providing basic information on all tasks available on OpenML. This function will not download the actual tasks, but will instead download meta data that can be used to filter the tasks and retrieve a set of IDs. We can filter this list, for example, we can only list tasks having a special tag or only tasks for a specific target such as supervised classification.

  2. A single task by its ID. It contains all meta information, the target metric, the splits and an iterator which can be used to access the splits in a useful manner.

Listing tasks

We will start by simply listing only supervised classification tasks:

tasks = openml.tasks.list_tasks(task_type_id=1)

openml.tasks.list_tasks() returns a dictionary of dictionaries by default, which we convert into a pandas dataframe to have better visualization capabilities and easier access:

tasks = pd.DataFrame.from_dict(tasks, orient='index')
print(tasks.columns)
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

# As conversion to a pandas dataframe is a common task, we have added this functionality to the
# OpenML-Python library which can be used by passing ``output_format='dataframe'``:
tasks_df = openml.tasks.list_tasks(task_type_id=1, output_format='dataframe')
print(tasks_df.head())

Out:

Index(['tid', 'ttid', 'did', 'name', 'task_type', 'status',
       'estimation_procedure', 'evaluation_measures', 'source_data',
       'target_feature', 'MajorityClassSize', 'MaxNominalAttDistinctValues',
       'MinorityClassSize', 'NumberOfClasses', 'NumberOfFeatures',
       'NumberOfInstances', 'NumberOfInstancesWithMissingValues',
       'NumberOfMissingValues', 'NumberOfNumericFeatures',
       'NumberOfSymbolicFeatures', 'cost_matrix'],
      dtype='object')
First 5 of 3307 tasks:
   tid  ttid  ...  NumberOfSymbolicFeatures cost_matrix
2    2     1  ...                      33.0         NaN
3    3     1  ...                      37.0         NaN
4    4     1  ...                       9.0         NaN
5    5     1  ...                      74.0         NaN
6    6     1  ...                       1.0         NaN

[5 rows x 21 columns]
   tid  ttid  ...  NumberOfSymbolicFeatures cost_matrix
2    2     1  ...                      33.0         NaN
3    3     1  ...                      37.0         NaN
4    4     1  ...                       9.0         NaN
5    5     1  ...                      74.0         NaN
6    6     1  ...                       1.0         NaN

[5 rows x 21 columns]

We can filter the list of tasks to only contain datasets with more than 500 samples, but less than 1000 samples:

filtered_tasks = tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000')
print(list(filtered_tasks.index))

Out:

[2, 11, 15, 29, 37, 41, 49, 53, 232, 241, 245, 259, 267, 271, 279, 283, 1766, 1775, 1779, 1793, 1801, 1805, 1813, 1817, 1882, 1891, 1895, 1909, 1917, 1921, 1929, 1933, 1945, 1952, 1956, 1967, 1973, 1977, 1983, 1987, 2079, 2125, 2944, 3022, 3034, 3047, 3049, 3053, 3054, 3055, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 4189, 4191, 4197, 4198, 4199, 4217, 4223, 4225, 4226, 4234, 4240, 4254, 4265, 4266, 4288, 4328, 4341, 4345, 4365, 4395, 4396, 4397, 4409, 4411, 4423, 4499, 4508, 4515, 4517, 4518, 4519, 4522, 4538, 4557, 4558, 4562, 4565, 4572, 4582, 4584, 4591, 4618, 4676, 4684, 4697, 4704, 7286, 7307, 7543, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 12738, 12739, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146574, 146576, 146577, 146578, 146583, 146587, 146588, 146593, 146596, 146597, 146600, 146818, 146819, 166859, 166875, 166882, 166884, 166893, 166905, 166906, 166907, 166913, 166915, 166919, 166947, 166953, 166956, 166957, 166958, 166959, 166960, 166967, 166976, 166977, 166978, 166980, 166983, 166988, 166989, 166992, 167016, 167020, 167031, 167037, 167062, 167067, 167068, 167095, 167096, 167100, 167104, 167106, 167151, 167154, 167160, 167163, 167167, 167168, 167171, 167173, 167174, 167175, 167180, 167184, 167187, 167194, 167198, 168300, 168783, 168819, 168820, 168821, 168822, 168823, 168824, 168825, 168907, 189786, 189859, 189899, 189900, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146]
# Number of tasks
print(len(filtered_tasks))

Out:

279

Then, we can further restrict the tasks to all have the same resampling strategy:

filtered_tasks = filtered_tasks.query('estimation_procedure == "10-fold Crossvalidation"')
print(list(filtered_tasks.index))

Out:

[2, 11, 15, 29, 37, 41, 49, 53, 2079, 3022, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 7286, 7307, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146818, 146819, 168300, 168907, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146]
# Number of tasks
print(len(filtered_tasks))

Out:

113

Resampling strategies can be found on the OpenML Website.

Similar to listing tasks by task type, we can list tasks by tags:

tasks = openml.tasks.list_tasks(tag='OpenML100', output_format='dataframe')
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

Out:

First 5 of 91 tasks:
    tid  ttid  ...  NumberOfNumericFeatures NumberOfSymbolicFeatures
3     3     1  ...                        0                       37
6     6     1  ...                       16                        1
11   11     1  ...                        4                        1
12   12     1  ...                      216                        1
14   14     1  ...                       76                        1

[5 rows x 19 columns]

Furthermore, we can list tasks based on the dataset id:

tasks = openml.tasks.list_tasks(data_id=1471, output_format='dataframe')
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

Out:

First 5 of 17 tasks:
         tid  ttid   did  ... target_value evaluation_measures number_samples
9983    9983     1  1471  ...          NaN                 NaN            NaN
14951  14951     1  1471  ...          NaN                 NaN            NaN
56483  56483     8  1471  ...            1                 NaN            NaN
56484  56484     8  1471  ...            1                 NaN            NaN
56485  56485     8  1471  ...            1                 NaN            NaN

[5 rows x 23 columns]

In addition, a size limit and an offset can be applied both separately and simultaneously:

tasks = openml.tasks.list_tasks(size=10, offset=50, output_format='dataframe')
print(tasks)

Out:

    tid  ttid  ...  NumberOfSymbolicFeatures number_samples
59   59     1  ...                         1            NaN
60   60     1  ...                        16            NaN
62   62     3  ...                        33              9
63   63     3  ...                        37             12
64   64     3  ...                         9              1
65   65     3  ...                        74              7
66   66     3  ...                        70              5
67   67     3  ...                         0              6
68   68     3  ...                        11              5
69   69     3  ...                        16              4

[10 rows x 21 columns]

OpenML 100 is a curated list of 100 tasks to start using OpenML. They are all supervised classification tasks with more than 500 instances and less than 50000 instances per task. To make things easier, the tasks do not contain highly unbalanced data and sparse data. However, the tasks include missing values and categorical features. You can find out more about the OpenML 100 on the OpenML benchmarking page.

Finally, it is also possible to list all tasks on OpenML with:

tasks = openml.tasks.list_tasks(output_format='dataframe')
print(len(tasks))

Out:

/home/travis/miniconda/envs/testenv/lib/python3.7/site-packages/pandas/core/frame.py:7138: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  sort=sort,
22861

Exercise

Search for the tasks on the ‘eeg-eye-state’ dataset.

tasks.query('name=="eeg-eye-state"')
MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize NumberOfClasses NumberOfFeatures NumberOfInstances NumberOfInstancesWithMissingValues NumberOfMissingValues NumberOfNumericFeatures NumberOfSymbolicFeatures cost_matrix did estimation_procedure evaluation_measures name number_samples quality_measure source_data source_data_labeled status target_feature target_feature_event target_feature_left target_feature_right target_value task_type tid ttid
3509 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 10-fold Crossvalidation NaN eeg-eye-state NaN NaN 1471 NaN active Class NaN NaN NaN NaN Supervised Classification 9983 1
4690 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 10-fold Crossvalidation NaN eeg-eye-state NaN NaN 1471 NaN active Class NaN NaN NaN NaN Supervised Classification 14951 1
8030 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Cortana Quality 1471 NaN active Class NaN NaN NaN 1 Subgroup Discovery 56483 8
8031 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Information gain 1471 NaN active Class NaN NaN NaN 1 Subgroup Discovery 56484 8
8032 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Binomial test 1471 NaN active Class NaN NaN NaN 1 Subgroup Discovery 56485 8
8033 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Jaccard 1471 NaN active Class NaN NaN NaN 1 Subgroup Discovery 56486 8
8034 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Cortana Quality 1471 NaN active Class NaN NaN NaN 2 Subgroup Discovery 56487 8
8035 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Information gain 1471 NaN active Class NaN NaN NaN 2 Subgroup Discovery 56488 8
8036 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Binomial test 1471 NaN active Class NaN NaN NaN 2 Subgroup Discovery 56489 8
8037 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN Jaccard 1471 NaN active Class NaN NaN NaN 2 Subgroup Discovery 56490 8
8579 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 33% Holdout set predictive_accuracy eeg-eye-state NaN NaN 1471 NaN active Class NaN NaN NaN NaN Supervised Classification 75219 1
8661 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 10-fold Learning Curve NaN eeg-eye-state 17 NaN 1471 NaN active Class NaN NaN NaN NaN Learning Curve 125901 3
9858 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 50 times Clustering NaN eeg-eye-state NaN NaN 1471 NaN active NaN NaN NaN NaN NaN Clustering 127251 5
11910 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN NaN 1471 NaN active Class NaN NaN NaN NaN Clustering 146761 5
13191 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 NaN NaN eeg-eye-state NaN NaN 1471 NaN active NaN NaN NaN NaN NaN Clustering 148117 5
17285 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 50 times Clustering NaN eeg-eye-state NaN NaN 1471 NaN active NaN NaN NaN NaN NaN Clustering 170139 5
20718 8257.0 2.0 6723.0 2.0 15.0 14980.0 0.0 0.0 14.0 1.0 NaN 1471 50 times Clustering NaN eeg-eye-state NaN NaN 1471 NaN active NaN NaN NaN NaN NaN Clustering 191646 5


Downloading tasks

We provide two functions to download tasks, one which downloads only a single task by its ID, and one which takes a list of IDs and downloads all of these tasks:

task_id = 31
task = openml.tasks.get_task(task_id)

Properties of the task are stored as member variables:

print(task)

Out:

OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/1
Task ID..............: 31
Task URL.............: https://www.openml.org/t/31
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available

And:

ids = [2, 1891, 31, 9983]
tasks = openml.tasks.get_tasks(ids)
print(tasks[0])

Out:

OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/1
Task ID..............: 2
Task URL.............: https://www.openml.org/t/2
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 6
Cost Matrix..........: Available

Creating tasks

You can also create new tasks. Take the following into account:

  • You can only create tasks on active datasets

  • For now, only the following tasks are supported: classification, regression, clustering, and learning curve analysis.

  • For now, tasks can only be created on a single dataset.

  • The exact same task must not already exist.

Creating a task requires the following input:

  • task_type_id: The task type ID, required (see below). Required.

  • dataset_id: The dataset ID. Required.

  • target_name: The name of the attribute you aim to predict. Optional.

  • estimation_procedure_id : The ID of the estimation procedure used to create train-test splits. Optional.

  • evaluation_measure: The name of the evaluation measure. Optional.

  • Any additional inputs for specific tasks

It is best to leave the evaluation measure open if there is no strong prerequisite for a specific measure. OpenML will always compute all appropriate measures and you can filter or sort results on your favourite measure afterwards. Only add an evaluation measure if necessary (e.g. when other measure make no sense), since it will create a new task, which scatters results across tasks.

Example

Let’s create a classification task on a dataset. In this example we will do this on the Iris dataset (ID=128 (on test server)). We’ll use 10-fold cross-validation (ID=1), and predictive accuracy as the predefined measure (this can also be left open). If a task with these parameters exists, we will get an appropriate exception. If such a task doesn’t exist, a task will be created and the corresponding task_id will be returned.

# using test server for example uploads
openml.config.start_using_configuration_for_example()

try:
    tasktypes = openml.tasks.TaskTypeEnum
    my_task = openml.tasks.create_task(
        task_type_id=tasktypes.SUPERVISED_CLASSIFICATION,
        dataset_id=128,
        target_name="class",
        evaluation_measure="predictive_accuracy",
        estimation_procedure_id=1)
    my_task.publish()
except openml.exceptions.OpenMLServerException as e:
    # Error code for 'task already exists'
    if e.code == 614:
        # Lookup task
        tasks = openml.tasks.list_tasks(data_id=128, output_format='dataframe')
        tasks = tasks.query('task_type == "Supervised Classification" '
                            'and estimation_procedure == "10-fold Crossvalidation" '
                            'and evaluation_measures == "predictive_accuracy"')
        task_id = tasks.loc[:, "tid"].values[0]
        print("Task already exists. Task ID is", task_id)

# reverting to prod server
openml.config.stop_using_configuration_for_example()

Out:

Task already exists. Task ID is 1408

Total running time of the script: ( 0 minutes 35.423 seconds)

Gallery generated by Sphinx-Gallery