Note

Go to the end to download the full example code.

Tasks¶

A tutorial on how to list and download tasks.

# License: BSD 3-Clause

import openml
from openml.tasks import TaskType
import pandas as pd

Tasks are identified by IDs and can be accessed in two different ways:

In a list providing basic information on all tasks available on OpenML. This function will not download the actual tasks, but will instead download meta data that can be used to filter the tasks and retrieve a set of IDs. We can filter this list, for example, we can only list tasks having a special tag or only tasks for a specific target such as supervised classification.
A single task by its ID. It contains all meta information, the target metric, the splits and an iterator which can be used to access the splits in a useful manner.

Listing tasks¶

We will start by simply listing only supervised classification tasks. openml.tasks.list_tasks() returns a dictionary of dictionaries by default, but we request a pandas dataframe instead to have better visualization capabilities and easier access:

tasks = openml.tasks.list_tasks(
    task_type=TaskType.SUPERVISED_CLASSIFICATION, output_format="dataframe"
)
print(tasks.columns)
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

Index(['tid', 'ttid', 'did', 'name', 'task_type', 'status',
       'estimation_procedure', 'evaluation_measures', 'source_data',
       'target_feature', 'MajorityClassSize', 'MaxNominalAttDistinctValues',
       'MinorityClassSize', 'NumberOfClasses', 'NumberOfFeatures',
       'NumberOfInstances', 'NumberOfInstancesWithMissingValues',
       'NumberOfMissingValues', 'NumberOfNumericFeatures',
       'NumberOfSymbolicFeatures', 'cost_matrix'],
      dtype='object')
First 5 of 4415 tasks:
   tid  ... cost_matrix
2    2  ...         NaN
3    3  ...         NaN
4    4  ...         NaN
5    5  ...         NaN
6    6  ...         NaN

[5 rows x 21 columns]

We can filter the list of tasks to only contain datasets with more than 500 samples, but less than 1000 samples:

filtered_tasks = tasks.query("NumberOfInstances > 500 and NumberOfInstances < 1000")
print(list(filtered_tasks.index))

[2, 11, 15, 29, 37, 41, 49, 53, 232, 241, 245, 259, 267, 271, 279, 283, 1766, 1775, 1779, 1793, 1801, 1805, 1813, 1817, 1882, 1891, 1895, 1909, 1917, 1921, 1929, 1933, 1945, 1952, 1956, 1967, 1973, 1977, 1983, 1987, 2079, 2125, 2944, 3022, 3034, 3047, 3049, 3053, 3054, 3055, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 4189, 4191, 4197, 4198, 4199, 4217, 4223, 4225, 4226, 4234, 4240, 4254, 4265, 4266, 4288, 4328, 4341, 4345, 4365, 4395, 4396, 4397, 4409, 4411, 4423, 4499, 4508, 4515, 4517, 4518, 4519, 4522, 4538, 4557, 4558, 4562, 4565, 4572, 4582, 4584, 4591, 4618, 4676, 4684, 4697, 4704, 7286, 7307, 7543, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 12738, 12739, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146574, 146576, 146577, 146578, 146583, 146587, 146588, 146593, 146596, 146597, 146600, 146818, 146819, 166859, 166875, 166882, 166884, 166893, 166905, 166906, 166907, 166913, 166915, 166919, 166947, 166953, 166956, 166957, 166958, 166959, 166960, 166967, 166976, 166977, 166978, 166980, 166983, 166988, 166989, 166992, 167016, 167020, 167031, 167037, 167062, 167067, 167068, 167095, 167096, 167100, 167104, 167106, 167151, 167154, 167160, 167163, 167167, 167168, 167171, 167173, 167174, 167175, 167180, 167184, 167187, 167194, 167198, 168300, 168783, 168819, 168820, 168821, 168822, 168823, 168824, 168825, 168907, 189786, 189859, 189899, 189900, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146, 233090, 233094, 233109, 233115, 233171, 233206, 359953, 359954, 359955, 360857, 360865, 360868, 360869, 360951, 360953, 360964, 361107, 361109, 361146, 361147, 361148, 361149, 361150, 361151, 361152, 361153, 361154, 361155, 361156, 361157, 361158, 361159, 361160, 361161, 361163, 361164, 361165, 361166, 361167, 361168, 361169, 361170, 361171, 361172, 361173, 361174, 361175, 361176, 361183, 361185, 361190, 361305, 361338, 361340, 361412, 361415, 361424, 361426, 361432, 361433, 361434, 361436, 361437, 361440, 361442, 361443, 361457, 361463, 361483, 361486, 361492, 361495, 361498, 361499, 361502, 361504, 361506, 361507, 361512, 361517, 361522, 361529, 361531, 361540, 361557, 361572, 361624, 361650, 361660, 361983, 361984, 361985, 361986, 361987, 361988, 361989, 362067, 362071, 362120]

# Number of tasks
print(len(filtered_tasks))

Then, we can further restrict the tasks to all have the same resampling strategy:

filtered_tasks = filtered_tasks.query('estimation_procedure == "10-fold Crossvalidation"')
print(list(filtered_tasks.index))

[2, 11, 15, 29, 37, 41, 49, 53, 2079, 3022, 3484, 3486, 3492, 3493, 3494, 3512, 3518, 3520, 3521, 3529, 3535, 3549, 3560, 3561, 3583, 3623, 3636, 3640, 3660, 3690, 3691, 3692, 3704, 3706, 3718, 3794, 3803, 3810, 3812, 3813, 3814, 3817, 3833, 3852, 3853, 3857, 3860, 3867, 3877, 3879, 3886, 3913, 3971, 3979, 3992, 3999, 7286, 7307, 7548, 7558, 9904, 9905, 9946, 9950, 9971, 9980, 9989, 9990, 10097, 10098, 10101, 14954, 14968, 145682, 145800, 145804, 145805, 145825, 145836, 145839, 145848, 145878, 145882, 145914, 145917, 145952, 145959, 145970, 145976, 145978, 146062, 146064, 146065, 146066, 146069, 146092, 146156, 146216, 146219, 146231, 146818, 146819, 168300, 168907, 189932, 189937, 189941, 190136, 190138, 190139, 190140, 190143, 190146, 233171, 359953, 359954, 359955, 360857, 360865, 360868, 360869, 360951, 360953, 360964, 361107, 361109, 361146, 361147, 361148, 361149, 361150, 361151, 361152, 361153, 361154, 361155, 361156, 361157, 361158, 361159, 361160, 361161, 361163, 361164, 361165, 361166, 361167, 361168, 361169, 361170, 361171, 361172, 361173, 361174, 361175, 361176, 361183, 361185, 361190, 361305, 361338, 361340, 361983, 361984, 361985, 361987, 361988, 362120]

# Number of tasks
print(len(filtered_tasks))

Resampling strategies can be found on the OpenML Website.

Similar to listing tasks by task type, we can list tasks by tags:

tasks = openml.tasks.list_tasks(tag="OpenML100", output_format="dataframe")
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

First 5 of 91 tasks:
    tid  ... NumberOfSymbolicFeatures
3     3  ...                       37
6     6  ...                        1
11   11  ...                        1
12   12  ...                        1
14   14  ...                        1

[5 rows x 19 columns]

Furthermore, we can list tasks based on the dataset id:

tasks = openml.tasks.list_tasks(data_id=1471, output_format="dataframe")
print(f"First 5 of {len(tasks)} tasks:")
print(tasks.head())

First 5 of 25 tasks:
         tid  ... number_samples
9983    9983  ...            NaN
14951  14951  ...            NaN
56483  56483  ...            NaN
56484  56484  ...            NaN
56485  56485  ...            NaN

[5 rows x 23 columns]

In addition, a size limit and an offset can be applied both separately and simultaneously:

tasks = openml.tasks.list_tasks(size=10, offset=50, output_format="dataframe")
print(tasks)

    tid  ... number_samples
 59  ...            NaN
 60  ...            NaN
 62  ...              9
 63  ...             12
 64  ...              1
 65  ...              7
 66  ...              5
 67  ...              6
 68  ...              5
 69  ...              4

[10 rows x 21 columns]

OpenML 100 is a curated list of 100 tasks to start using OpenML. They are all supervised classification tasks with more than 500 instances and less than 50000 instances per task. To make things easier, the tasks do not contain highly unbalanced data and sparse data. However, the tasks include missing values and categorical features. You can find out more about the OpenML 100 on the OpenML benchmarking page.

Finally, it is also possible to list all tasks on OpenML with:

tasks = openml.tasks.list_tasks(output_format="dataframe")
print(len(tasks))

Exercise¶

Search for the tasks on the ‘eeg-eye-state’ dataset.

tasks.query('name=="eeg-eye-state"')

	tid	ttid	did	name	task_type	status	estimation_procedure	evaluation_measures	source_data	target_feature	MajorityClassSize	MaxNominalAttDistinctValues	MinorityClassSize	NumberOfClasses	NumberOfFeatures	NumberOfInstances	NumberOfNumericFeatures	NumberOfSymbolicFeatures	number_samples	cost_matrix	source_data_labeled	target_feature_event	target_feature_left	target_feature_right	quality_measure	target_value
3511	9983	TaskType.SUPERVISED_CLASSIFICATION	1471	eeg-eye-state	Supervised Classification	active	10-fold Crossvalidation	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4692	14951	TaskType.SUPERVISED_CLASSIFICATION	1471	eeg-eye-state	Supervised Classification	active	10-fold Crossvalidation	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8032	56483	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Cortana Quality	1
8033	56484	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Information gain	1
8034	56485	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Binomial test	1
8035	56486	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Jaccard	1
8036	56487	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Cortana Quality	2
8037	56488	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Information gain	2
8038	56489	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Binomial test	2
8039	56490	TaskType.SUBGROUP_DISCOVERY	1471	eeg-eye-state	Subgroup Discovery	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Jaccard	2
8581	75219	TaskType.SUPERVISED_CLASSIFICATION	1471	eeg-eye-state	Supervised Classification	active	33% Holdout set	predictive_accuracy	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8663	125901	TaskType.LEARNING_CURVE	1471	eeg-eye-state	Learning Curve	active	10-fold Learning Curve	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	17	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9864	127251	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11920	146761	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	NaN	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13204	148117	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	NaN	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17309	170139	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20760	191646	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24446	213320	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27655	234435	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31979	256739	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31989	256750	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
37166	297708	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
40370	318835	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
43587	339990	TaskType.CLUSTERING	1471	eeg-eye-state	Clustering	active	50 times Clustering	NaN	1471	NaN	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
46851	361459	TaskType.SUPERVISED_CLASSIFICATION	1471	eeg-eye-state	Supervised Classification	active	5 times 2-fold Crossvalidation	NaN	1471	Class	8257.0	2.0	6723.0	2.0	15.0	14980.0	14.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Downloading tasks¶

We provide two functions to download tasks, one which downloads only a single task by its ID, and one which takes a list of IDs and downloads all of these tasks:

task_id = 31
task = openml.tasks.get_task(task_id)

Properties of the task are stored as member variables:

print(task)

OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 31
Task URL.............: https://www.openml.org/t/31
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 2
Cost Matrix..........: Available

And:

ids = [2, 1891, 31, 9983]
tasks = openml.tasks.get_tasks(ids)
print(tasks[0])

/home/runner/work/openml-python/openml-python/openml/tasks/functions.py:372: UserWarning: `download_data` will default to False starting in 0.16. Please set `download_data` explicitly to suppress this warning.
  warnings.warn(
/home/runner/work/openml-python/openml-python/openml/tasks/functions.py:380: UserWarning: `download_qualities` will default to False starting in 0.16. Please set `download_qualities` explicitly to suppress this warning.
  warnings.warn(

  0%|          | 0.00/35.0k [00:00<?, ?B/s]
100%|██████████| 35.0k/35.0k [00:00<00:00, 228kB/s]
100%|██████████| 35.0k/35.0k [00:00<00:00, 228kB/s]

  0%|          | 0.00/4.94k [00:00<?, ?B/s]
100%|██████████| 4.94k/4.94k [00:00<00:00, 16.3MB/s]

  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 182kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 182kB/s]

  0%|          | 0.00/279k [00:00<?, ?B/s]
100%|██████████| 279k/279k [00:00<00:00, 450kB/s]
100%|██████████| 279k/279k [00:00<00:00, 450kB/s]
OpenML Classification Task
==========================
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 2
Task URL.............: https://www.openml.org/t/2
Estimation Procedure.: crossvalidation
Evaluation Measure...: predictive_accuracy
Target Feature.......: class
# of Classes.........: 6
Cost Matrix..........: Available

Creating tasks¶

You can also create new tasks. Take the following into account:

You can only create tasks on active datasets
For now, only the following tasks are supported: classification, regression, clustering, and learning curve analysis.
For now, tasks can only be created on a single dataset.
The exact same task must not already exist.

Creating a task requires the following input:

task_type: The task type ID, required (see below). Required.
dataset_id: The dataset ID. Required.
target_name: The name of the attribute you aim to predict. Optional.
estimation_procedure_id : The ID of the estimation procedure used to create train-test splits. Optional.
evaluation_measure: The name of the evaluation measure. Optional.
Any additional inputs for specific tasks

It is best to leave the evaluation measure open if there is no strong prerequisite for a specific measure. OpenML will always compute all appropriate measures and you can filter or sort results on your favourite measure afterwards. Only add an evaluation measure if necessary (e.g. when other measure make no sense), since it will create a new task, which scatters results across tasks.

We’ll use the test server for the rest of this tutorial.

Warning

This example uploads data. For that reason, this example connects to the test server at test.openml.org. This prevents the main server from crowding with example datasets, tasks, runs, and so on. The use of this test server can affect behaviour and performance of the OpenML-Python API.

openml.config.start_using_configuration_for_example()

/home/runner/work/openml-python/openml-python/examples/30_extended/tasks_tutorial.py:171: UserWarning: Switching to the test server https://test.openml.org/api/v1/xml to not upload results to the live server. Using the test server may result in reduced performance of the API!
  openml.config.start_using_configuration_for_example()

Example¶

Let’s create a classification task on a dataset. In this example we will do this on the Iris dataset (ID=128 (on test server)). We’ll use 10-fold cross-validation (ID=1), and predictive accuracy as the predefined measure (this can also be left open). If a task with these parameters exists, we will get an appropriate exception. If such a task doesn’t exist, a task will be created and the corresponding task_id will be returned.

try:
    my_task = openml.tasks.create_task(
        task_type=TaskType.SUPERVISED_CLASSIFICATION,
        dataset_id=128,
        target_name="class",
        evaluation_measure="predictive_accuracy",
        estimation_procedure_id=1,
    )
    my_task.publish()
except openml.exceptions.OpenMLServerException as e:
    # Error code for 'task already exists'
    if e.code == 614:
        # Lookup task
        tasks = openml.tasks.list_tasks(data_id=128, output_format="dataframe")
        tasks = tasks.query(
            'task_type == "Supervised Classification" '
            'and estimation_procedure == "10-fold Crossvalidation" '
            'and evaluation_measures == "predictive_accuracy"'
        )
        task_id = tasks.loc[:, "tid"].values[0]
        print("Task already exists. Task ID is", task_id)

# reverting to prod server
openml.config.stop_using_configuration_for_example()

/home/runner/work/openml-python/openml-python/openml/tasks/functions.py:286: RuntimeWarning: Could not create task type id for 10 due to error 10 is not a valid TaskType
  procs = _get_estimation_procedure_list()
/home/runner/work/openml-python/openml-python/openml/tasks/functions.py:286: RuntimeWarning: Could not create task type id for 11 due to error 11 is not a valid TaskType
  procs = _get_estimation_procedure_list()
/home/runner/work/openml-python/openml-python/openml/tasks/functions.py:235: RuntimeWarning: Could not create task type id for 10 due to error 10 is not a valid TaskType
  return __list_tasks(api_call=api_call, output_format=output_format)
Task already exists. Task ID is 1308

Total running time of the script: (1 minutes 2.981 seconds)

Gallery generated by Sphinx-Gallery