Metadata definition

OpenML is at its core a meta-database, from which datasets, pipelines (flows), experiments (runs) and other entities can be downloaded and uploaded, all described using a clearly defined meta-data standard. In this document, we describe the standard how to upload entities to OpenML and what the resulting database state will be.

Croissant

OpenML has partnered with MLCommons, Google, Kaggle, HuggingFace, and a consortium of other partners to define a new metadata standard for machine learning datasets: Croissant! You can already download all OpenML datasets in the Croissant format, and we're working further supporting and extending Croissant.

Below is the OpenML metadata standard for version 1 of the API.

Data¶

Data is uploaded through the function post data. The following files are needed:

description: An XML adhiring to the XSD schema.
dataset: An ARFF file containing the data (optional, if not set, there should be an URL in the description, pointing to this file). Uploading any other files will result in an error.

Tasks¶

Tasks are uploaded through the function post task. The following files are needed:

description: An XML adhering to the XSD schema. Uploading any other files will result in an error.

The task file should contain several input fields. These are a name and value combination of fields that are marked to be relevant by the task type definition. There are several task type definitions, e.g.:

Note that the task types themselves are flexible content (ideally users can contribute task types) and therefore the documents are not part of the OpenML definition. The task types define which input fields should be set, when creating a task.

Duplicate tasks (i.e., same value for task_type_id and all input fields equal) will be rejected.

When creating a task, the API checks for all of the input fields whether the input is legitimate. (Todo: describe the checks and what they depend on).

Flow¶

Flows are uploaded through the function post flow. The following file is needed:

description: An XML adhering to the XSD schema. Uploading any other files will result in an error.

Duplicate flows (i.e., same values for name and external_version) will be rejected.

Runs¶

Runs are uploaded through the function post run. The following files are needed:

description: An XML adhering to the XSD schema.
predictions: An ARFF file containing the predictions (optional, depending on the task).
trace: An ARFF file containing the run trace (optional, depending on the flow). Uploading any other files will result in an error.

Predictions¶

The contents of the prediction file depends on the task type.

Task type: Supervised classification¶

Example predictions file

repeat NUMERIC
fold NUMERIC
row_id NUMERIC
confidence.{$classname}: optional. various columns, describing the confidence per class. The values of these columns should add to 1 (precision 1e-6).
(proposal) decision_function.{$classname}: optional. various columns, describing decision function per class.
prediction {$classname} Runs that have a different set of columns will be rejected.

Trace¶

Example trace file

repeat: cross-validation repeat
fold: cross-validation fold
iteration: the index order within this repeat/fold combination
evaluation (float): the evaluation score that was attached based on the validation set
selected {True, False}: Whether in this repeat/run combination this was the selected hyperparameter configuration (exactly one should be tagged with True)
Per optimized parameter a column that has the name of the parameter and the prefix "parameter_"
setup_string: Due to legacy reasons accepted, but will be ignored by the default evaluation engine

Traces that have a different set of columns will be rejected.