Skip to content

openml_expdb Database

The openml_expdb database contains all experiment-related data: datasets, tasks, flows (implementations), runs, evaluations, and studies. The "expdb" part stands for "experiment database", the name used in Joaquin Vanschoren's thesis.

Some remarks which apply generally:

  • datetime fields are in format (YYYY-MM-DD hh:mm:ss).
  • some varchar columns in production only have a very limited set of values. If the description says "one of.." it denotes only those values are present in the database in production.

There are a few tables which never were or no longer are in use. They will be removed and so are not documented here. They are: data_quality_interval, feature_quality, output_data (only present on the test server), algorithm and algorithm_quality. For more information, see server-api#165.


Datasets

dataset

Stores dataset metadata including source, format, licensing, and upload information.

Column Type Optional Default References Description Example
did int unsigned No auto_increment Primary key (dataset ID). 42
uploader mediumint unsigned Yes NULL openml.users.id User who uploaded the dataset. 2
source int unsigned Yes NULL ID of the original source dataset if derived. Used only rarely. 3
name varchar(128) No Dataset name. iris
version varchar(64) No Version string, displayed on the dataset page. 3
version_label varchar(128) Yes NULL Human-readable label. tabular-ml-iid-study-0.0.1
format varchar(64) No 'arff' File format, one of: ARFF, Sparse_ARFF, CSV, "CSV, MAT", txt, Rimage ARFF
creator text Yes NULL Original creator(s) of the dataset. "Jason", "G. Davies, A. Horne"
contributor text Yes NULL People who contributed to the current version (e.g., formatting). "A. Dent", "The Data Institute"
collection_date varchar(128) Yes NULL When the data was originally collected, in any format. "1980", "2024-10-30", "2022, 5 May"
upload_date datetime No When the dataset was uploaded to OpenML. 2014-04-06 23:19:20
language varchar(128) Yes NULL Language of the dataset content. "En", "Dutch"
licence varchar(64) No 'Public' License under which the dataset is shared. Public
citation text Yes NULL How to cite this dataset, sometimes a link to the policy. Bareiss, E. Ray, & [...] Proceedings of [...]
collection varchar(64) Yes NULL Collection the dataset belongs to. Not used. NULL
url mediumtext No URL to download the dataset file. Most often links to OpenML, but not always. https://openml.org/data/download/22126628/test.arff
parquet_url mediumtext Yes NULL NOT IN PRODUCTION URL to Parquet version of the dataset.
isOriginal enum('true','false') Yes NULL Whether this is an original dataset (not derived). true
file_id int Yes NULL openml.file.id Reference to the uploaded file. 60
default_target_attribute varchar(1024) Yes NULL Name of the default target column. Allows csv. class
row_id_attribute varchar(128) Yes NULL Name of the row identifier column. Allows csv. id
ignore_attribute varchar(128) Yes NULL Columns to ignore during modeling. Allows csv. "animal", "'url_hash', 'query_id'"
paper_url mediumtext Yes NULL URL to an associated publication. https://arxiv.org/html/2402.55618v3
visibility varchar(128) No 'public' Visibility level (public, private, or friends). Note non-public visibility are currently not supported. public
original_data_id int Yes NULL ID of the original dataset this was derived from. 23
original_data_url mediumtext Yes NULL URL to the original data source. https://zenodo.org/record/322475/files/bike.arff
update_comment text Yes NULL Comment explaining the latest update. fixed features
last_update datetime Yes NULL Timestamp of the last update. 2017-10-28 23:42:18

dataset_description

Stores versioned descriptions for datasets.

Column Type Optional Default References Description Example
did int unsigned No dataset.did Dataset this description belongs to. 42
version int unsigned No Description version number. 2
description text No The dataset description text. This dataset describes...
uploader mediumint unsigned No openml.users.id User who wrote this description. 3

dataset_status

Tracks the current status of the dataset. The absence of an entry in this table for a dataset denotes that the dataset is "in preparation". Rows are removed from this table to indicate transitions (deactivated -> activate)! It does not provide a historical record of the dataset status.

Allowed transitions are (as defined by the PHP implementation):

  • in preparation -> activate
  • in preparation -> deactivated
  • deactivated -> activate
Column Type Optional Default References Description Example
did int unsigned No dataset.did Dataset ID. 42
status enum('active','deactivated') No Status value. active
status_date datetime No When the status was set. 2022-04-10 11:10:42
user_id mediumint unsigned No openml.users.id User who changed the status. 1

dataset_tag

User-assigned tags on datasets for categorization and search. Note that historically, collections used tags (e.g., study_14 indicates the dataset is linked to study 14).

Column Type Optional Default References Description Example
id int unsigned No dataset.did Dataset ID. 3
tag varchar(255) No Tag string. Medicine, study_14
uploader mediumint unsigned No openml.users.id User who added the tag. 4
date timestamp No CURRENT_TIMESTAMP When the tag was added. 2018-11-04 08:57:17

dataset_topic

Assigns topic labels to datasets. Topics are displayed as tags on the web page. This is the result of an experiment in 2021 to try to categorize datasets to better facilitate search. Topics have been added by automated analysis.

Column Type Optional Default References Description Example
id int unsigned No dataset.did Dataset ID. 5
topic varchar(255) No Topic label. Health
uploader mediumint unsigned No openml.users.id User who assigned the topic. 8111
date timestamp No CURRENT_TIMESTAMP When the topic was assigned. 2021-06-01 12:32:45

Data Features and Qualities

data_feature

Stores metadata and statistics for each feature (column) of a dataset, as computed by an evaluation engine.

What's going on?

We need more documentation for the computed columns NumberOf* and ClassDistribution. From just the database values, it's unclear what's going on. For example, 'nominal' features have their NumberOfIntegerValues column populated.

Column Type Optional Default References Description Example
did int unsigned No dataset.did Dataset ID. 2
index int unsigned No Feature index (column position), 0-indexed. 0
evaluation_engine_id int No evaluation_engine.id Engine that computed the feature metadata. 1
name varchar(64) No Feature name. petal-width
data_type varchar(64) Yes NULL Data type (numeric, nominal, string, date, unknown). numeric
is_target enum('true','false') No 'false' Whether this is the default target feature. true
is_row_identifier enum('true','false') No 'false' Whether this feature is a row ID. false
is_ignore enum('true','false') No 'false' Whether this feature should be ignored. false
NumberOfDistinctValues int Yes NULL Number of distinct values. 14972
NumberOfUniqueValues int Yes NULL Number of values that appear exactly once. 0
NumberOfMissingValues int No Count of missing values. 124
NumberOfIntegerValues int Yes NULL Count of integer values.
NumberOfRealValues int Yes NULL Count of real-valued entries.
NumberOfNominalValues varchar(512) No Number of nominal categories (or distribution).
NumberOfValues int No Total number of values.
MaximumValue int Yes NULL Maximum value (for numeric features).
MinimumValue int Yes NULL Minimum value (for numeric features).
MeanValue int Yes NULL Mean value (for numeric features).
StandardDeviation int Yes NULL Standard deviation (for numeric features).
ClassDistribution text Yes NULL Only for nominal features, otherwise '[]'. See below. [["?","T"],[[7, 99, 586, 0, 67, 2],[1, 0, 98, 0, 0, 38]]]

ClassDistribution is [] for all features except nominal ones, and only populated if the dataset has a target feature that is also nominal (note that a dataset can have multiple target features). For nominal features, it is a U x C matrix where U is the number of unique values in the feature and C is the number of classes of the target feature of the dataset. Each cell specifies how often the feature value occurs for the specific target value. The behavior with multiple target features that are nominal is unspecified.

data_feature_description

User-provided descriptions and ontology annotations for individual data features. The table is empty in production (as of 2026-04-29) and should be considered experimental. This feature was added recently (2025ish) should be considered experimental.

Column Type Optional Default References Description Example
did int unsigned No data_feature(did, index) Dataset ID. 2
index int unsigned No data_feature(did, index) Feature index (column position, 0-indexed). 0
uploader mediumint unsigned No User who added the description. 1548
date timestamp No CURRENT_TIMESTAMP When the description was added. 2025-12-12 09:29:30
description_type enum('plain','ontology') No Type of description. 'ontology'
value varchar(256) No The description or ontology URI. http://xmlns.com/foaf/0.1/#name

data_feature_value

Enumerates the distinct values of a feature (only for nominal features).

Column Type Optional Default References Description Example
did int unsigned No data_feature(did, index) Dataset ID. 2
index int unsigned No data_feature(did, index) Feature index. 3
value varchar(256) No A distinct value of the feature. 'tulip'

data_quality

Stores computed quality measures (meta-features) for datasets, such as number of instances or entropy. Measures are computed by an evaluation engine.

Column Type Optional Default References Description Example
data int unsigned No 0 dataset.did Dataset ID. 2
quality varchar(128) No quality.name Name of the quality measure. 'AutoCorrelation'
evaluation_engine_id int No evaluation_engine.id Engine that computed the quality. 1
value varchar(128) Yes NULL Computed value. 0.5
description text Yes NULL Additional description or notes.

data_processed

Tracks whether a dataset has been processed by an evaluation engine (feature extraction, quality computation). Note that this table is used with in-place edits. Retrying a failed will increment num_tries, not add a new row.

Column Type Optional Default References Description Example
did int unsigned No dataset.did Dataset ID. 1
evaluation_engine_id int No evaluation_engine.id Engine that processed the dataset. 1
user_id int No User who triggered processing. 1
processing_date datetime No When processing completed. 2020-02-04 12:45:01
error text Yes NULL Error message if processing failed. "keyword @relation expected, read Token[Id], line 1"
warning text Yes NULL Warning messages from processing. Always NULL on prod. NULL
num_tries int No 1 Number of processing attempts. 3

quality

Defines the available dataset quality measures (meta-features) and their properties.

Column Type Optional Default References Description Example
name varchar(128) No Primary key. Name of the quality measure. AutoCorrelation
type varchar(128) No 'DataProperty' Category, one of DataQuality, FeatureQuality, AlgorithmQuality. FeatureQuality
formula text Yes NULL Formula or computation method. Always NULL except for 1 row. Calculated using...
description text Yes NULL Human-readable description. Average class difference between consecutive instances.
datatype varchar(128) No 'undefined' Expected data type of the value. One of 'double', 'integer', 'undefined'.
min float Yes NULL Minimum possible value. 0
max float Yes NULL Maximum possible value. 1.0
unit varchar(32) Yes NULL Unit of measurement. NULL
priority int No 9999 Display priority (lower is higher priority). 2
showonline enum('true','false') No Whether to show on the website. true
date timestamp No CURRENT_TIMESTAMP When the quality was defined. 2014-12-31 23:00:00

Tasks

To more easily understand the way tasks are structured, we break down tasks into three components:

  • the task type: concept of what the task is (e.g., Supervised Classification) and the task_type_inout specifies the required and optional inputs and outputs for such a task. For example, a Supervised Classification task must have (among others) a target feature specified.
  • the estimation procedure: the experimental setup to evaluate model quality for the task (e.g., 10-fold Cross Validation)
  • the task together with its task_inputs specify concrete values for the task (e.g., 10-fold Cross Validation on Iris predicting Class).

For example, task 59 (10-fold Cross Validation on Iris) has the task row in the database (task_id=59, ttid=1, ...):

  1. It is of task type 1, Supervised Classification (from resolving its ttid column against the task_type table).
  2. When the task was created, the creator had to specify the following input since they are required for the task type as dictated by task_type_inout:
    • source data: to which dataset is the task linked?
    • estimation procedure: what is the experimental setup for the runs?
    • target feature: what is the target to predict?
  3. It could have optionally also specified the following properties, as indicated by the task_type_inout table: cost matrix, custom testset, evaluation measures.
  4. The task_inputs table is to specify what the corresponding values are for this task. E.g., the row with input "source_data" has value "61" specifying that dataset 61 is used. It similarly defines the estimation procedure, target feature, and also the optional input of "evaluation measure". That is to say, task_inputs primarily references other tables.
  5. The estimation procedure required by the task type, and given a value in task_inputs, must correspond to an entry in the estimation_procedure table. E.g., if we find (estimation_procedure, 1) as a task input, we know it is 10-fold Cross Validation (the row with id=1 in estimation_procedure).
  6. task_type_inout further specifies that they expect output of experiments of the task may contain: evaluation measures, model, predictions. This is not directly used for tasks, but rather for runs.

So there are constraints from the task_inputs table to multiple other tables (dataset, estimation_procedure, ...) that are not explicitly present in the database. The task_inputs is also expected to be populated with entries for each task dependent on the respective input defined by the task_type_inout table. The estimation_procedure_type table provides general descriptions for procedures (such as Hold out), they are matched by name even though the relationship is not explicit in the database.

task_type

Defines the types of machine learning tasks available (e.g., classification, regression).

Column Type Optional Default References Description Example
ttid int No auto_increment Primary key (task type ID). 1
name varchar(128) No Task type name. Supervised Classification
description text No Description of the task type. Predict the value of a nominal feature given the other features.
creator varchar(128) No Who defined this task type. "Alice Smith, John Hickey"
contributors text Yes NULL Additional contributors. "Piet Houten, Betty Boss"
creationDate datetime No When the task type was created. 2017-09-28 11:23:09

task_type_inout

Defines the input and output specifications for each task type.

Column Type Optional Default References Description Example
ttid int No task_type.ttid Task type ID. 1
name varchar(64) No Name of the input/output parameter. See note. source_data
type varchar(64) No Data type of the parameter, from task_io_types. String, "Estimation Procedure"
io enum('input','output') No Whether this is an input or output. input
requirement enum('required','optional','hidden') No Whether this parameter is required. required
description varchar(256) No Description of the parameter. "This input is required to foo the bar."
order int No Display order used by the frontend. 29
api_constraints text Yes NULL API-level constraints on this parameter as JSON. See below.
template_api text Yes NULL Template for API representation. See below.
template_search text Yes NULL Template for search representation. See below.

The api_constraints, template_api and template_search columns contain values that are used by PHP to help format responses.

The api_constraints contains JSON with optionally some special instructions:

{
"data_type": "string",
"select": "name",
"from": "data_feature",
"where": "did = '[INPUT:source_data]' AND data_type = 'nominal'"
}
Production uses the following directives:

  • [INPUT:source_data]: look up the value of task_inputs.value where input="source_data" and task_id matches the task.
  • [TASK:ttid]: look up the value of task.ttid for that task.

The template_api contains XML instead:

 <oml:estimation_procedure>
<oml:id>[INPUT:estimation_procedure]</oml:id>
<oml:type>[LOOKUP:estimation_procedure.type]</oml:type>
<oml:data_splits_url>[CONSTANT:base_url]api_splits/get/[TASK:id]/Task_[TASK:id]_splits.arff</oml:data_splits_url>
<oml:parameter name="number_repeats">[LOOKUP:estimation_procedure.repeats]</oml:parameter>
<oml:parameter name="number_folds">[LOOKUP:estimation_procedure.folds]</oml:parameter>
<oml:parameter name="percentage">[LOOKUP:estimation_procedure.percentage]</oml:parameter>
<oml:parameter name="stratified_sampling">[LOOKUP:estimation_procedure.stratified_sampling]</oml:parameter>
</oml:estimation_procedure>
Additionally contains the directives:

  • [LOOKUP:*]: looks up the TABLE.row value that are associated with the task.
  • [CONSTANT:*]: look up constants configured in PHP, not in the database.

Finally, the template_search contains JSON again:

{
  "name": "Dataset(s)",
  "autocomplete": "commaSeparated",
  "datasource": "expdbDatasetVersion()",
  "placeholder": "(*) include all datasets"
}

The expdbDatasetVersion() function is no longer used.

task_io_types

Defines the valid types for task inputs and outputs. Used in the task_type_inout table.

Column Type Optional Default References Description Example
name varchar(64) No Primary key. Type name. String
description text No Description of this I/O type. A string, possibly contains csv values.

task

Represents a specific machine learning task.

Column Type Optional Default References Description Example
task_id int No auto_increment Primary key. 59
ttid int No task_type.ttid Task type. 1
creator mediumint unsigned Yes NULL openml.users.id User who created the task. 1
creation_date datetime Yes NULL When the task was created. 2014-11-02 03:12:15
embargo_end_date datetime Yes NULL End date of any data embargo. 2025-10-30 23:52:28

task_inputs

Stores the input parameter values for a specific task instance.

Column Type Optional Default References Description Example
task_id int No task.task_id Task ID. 59
input varchar(64) No Input parameter name. source_data
value text No Input parameter value. 61

Valid input values depend on the task_type_inout for the task_type that's specified in the task.

task_tag

User-assigned tags on tasks.

Column Type Optional Default References Description Example
id int No task.task_id Task ID. 59
tag varchar(255) No Tag string. uci
uploader mediumint unsigned No openml.users.id User who added the tag. 2
date timestamp No CURRENT_TIMESTAMP When the tag was added. 2022-10-09 17:18:19

estimation_procedure_type

Defines the types of estimation procedures (e.g., cross-validation, holdout). Variations are defined in the estimation_procedure table. E.g., this table describes "cross validation" and estimation_procedure describes 10-fold cross validation, 5-fold cross validation, and so on.

Column Type Optional Default References Description Example
name varchar(64) No Primary key. Procedure type name. crossvalidation
description text No Description of the procedure type. a process where the dataset is divided into folds, ...

estimation_procedure

Defines specific estimation procedure configurations used in tasks.

Column Type Optional Default References Description Example
id int No auto_increment Primary key. 1
ttid int No task_type.ttid Task type this procedure applies to. 1
name varchar(128) No Procedure name. 10-fold cross validation
type enum('crossvalidation','leaveoneout','holdout','holdout_ordered','bootstrapping','subsampling','testthentrain','holdoutunlabeled','customholdout','testontrainingdata') No Procedure type. crossvalidation
repeats int Yes NULL Number of repetitions. 1
folds int Yes NULL Number of folds. 10
samples enum('false','true') No 'false' Whether learning curve samples are used. false
percentage int Yes NULL Train/test split percentage. NULL
stratified_sampling enum('true','false') Yes NULL Whether stratified sampling is used. true
custom_testset enum('true','false') No 'false' Whether a custom test set is provided. false
date timestamp No CURRENT_TIMESTAMP When the procedure was created. 2020-02-20 20:02:20

Flows (Implementations)

The database uses historical name 'implementation' for a flow.

implementation

Stores machine learning flows (algorithms/pipelines) that can be executed on tasks.

Column Type Optional Default References Description Example
id int No auto_increment Primary key (flow ID). 1
fullName varchar(1024) No Fully qualified name (name + version). sklearn.tree.DecisionTreeClassifier(1)
uploader mediumint unsigned Yes NULL openml.users.id User who uploaded the flow. 2
name varchar(1024) No Flow name. sklearn.tree.DecisionTreeClassifier
custom_name varchar(256) Yes NULL User-defined display name. Tree
class_name varchar(256) Yes NULL Class name in the source library. nl.liacs.subdisc.SubgroupDiscovery
version int No Internal version number. 1
external_version varchar(128) No Version string from the source library. "sklearn=0.12.3,numpy=0.22.1"
creator varchar(128) Yes NULL Original creator of the algorithm. Arlo Knoppen
contributor text Yes NULL Additional contributors. "Marcel Alder"
uploadDate datetime No When the flow was uploaded. 2015-12-21 18:28:38
licence varchar(64) Yes NULL License. public domain
language varchar(128) Yes NULL Language the description is written in. Sometimes used to specify programming language instead. English
description text Yes NULL Short description. Common Decision Tree algorithm
fullDescription text Yes NULL Full description. This algorithm partitions the data...
installationNotes text Yes NULL Installation instructions. Not really used. Runs on OpenML
dependencies text Yes NULL Required dependencies. mlr_2.3
implements varchar(128) Yes NULL Algorithm or standard this flow implements. Seems to be legacy build_cpu_time
binary_file_id int Yes NULL File ID of the compiled binary. 32
source_file_id int Yes NULL File ID of the source code. 33
visibility varchar(128) No 'public' Visibility level. One of 'public' or 'private' public
citation text Yes NULL How to cite this flow. Only used once. "A. Boo et al., Journal of ..."

implementation_component

Defines parent-child relationships between flows (e.g., a pipeline containing sub-flows).

Column Type Optional Default References Description Example
parent int No implementation.id Parent flow ID. 1
child int No implementation.id Child (component) flow ID. 2
identifier varchar(1024) Yes NULL Role or name of the component within the parent. PCA, scaler

implementation_tag

User-assigned tags on flows.

Column Type Optional Default References Description Example
id int No implementation.id Flow ID. 2
tag varchar(255) No Tag string. Verified_Supervised_Classification
uploader mediumint unsigned No openml.users.id User who added the tag. 2
date timestamp No CURRENT_TIMESTAMP When the tag was added. 2020-06-08 09:20:28

input

Defines the hyperparameters (input parameters) of a flow.

Column Type Optional Default References Description Example
id int No auto_increment Primary key. 2
implementation_id int No implementation.id Flow this parameter belongs to. 1
name varchar(512) Yes NULL Parameter name. C
description text Yes NULL Parameter description. Regularization parameter.
dataType varchar(255) Yes NULL Expected data type. float
defaultValue text Yes NULL Default value. 1.0
recommendedRange text Yes NULL Recommended value range. 10e-3 to 10e3

Setups

Setups are really part of a run, they specifically describe the algorithm (flow) configuration used in the run. They are not a separate entity. Each row in the run table refers to a setup. A setup may be shared by multiple runs.

algorithm_setup

Represents a specific configuration (hyperparameter setting) of a flow, including nested component setups.

Column Type Optional Default References Description Example
sid int unsigned No auto_increment Primary key (setup ID). 1
parent int unsigned No Parent setup ID (for nested components). NULL
implementation_id int No implementation.id Flow this setup configures. 3
algorithm varchar(128) Yes NULL Algorithm name. Always NULL is prod! NULL
role varchar(64) No 'Learner' Role of this component (e.g., Learner, Preprocessor). learner
isDefault enum('true','false') Yes 'false' Whether this is the default setup. true
algorithm_structure varchar(64) Yes NULL Not sure. Always NULL in prod. NULL
setup_string text Yes NULL Serialized setup string. Can aid reproducing the run. "weka.classifiers.trees.J48 -- -C 0.25 -M 2"

input_setting

Stores the actual hyperparameter values for a specific setup.

Column Type Optional Default References Description Example
setup int unsigned No 0 algorithm_setup.sid Setup ID. 1
input varchar(128) No Parameter name. Only old setups have this. 428_I
input_id int No input.id Reference to the parameter definition. 4124
value varchar(2048) No Parameter value. 1000, gini

setup_tag

User-assigned tags on setups. Setups can not really be assigned by users, only admins. Runs should be tagged by users.

These tags are only used in the filtering of /setups/list, they are not returned with a setup or run.

Column Type Optional Default References Description Example
id int unsigned No algorithm_setup.sid Setup ID. 42
tag varchar(255) No Tag string. test_setup
uploader mediumint unsigned No openml.users.id User who added the tag. 2
date timestamp No CURRENT_TIMESTAMP When the tag was added. 2024-12-01 23:04:57

setup_differences

Precomputed pairwise differences between setups on specific tasks. Not entirely sure what this means, but seems to be out of use (last setup included is from a run in 2019).

Column Type Optional Default References Description Example
sidA int No First setup ID. 2
sidB int No Second setup ID. 3
task_id int No Task ID. 1245
task_size int No Size of the task dataset. 100000
differences int No ??? 812987

Runs and Evaluations

run

Represents the execution of a flow setup on a task.

Column Type Optional Default References Description Example
rid int unsigned No auto_increment Primary key (run ID). 2
uploader mediumint unsigned Yes NULL openml.users.id User who uploaded the run. 16
setup int unsigned No 0 algorithm_setup.sid Setup (hyperparameter configuration) used. 2894
task_id int Yes NULL task.task_id Task the run was executed on. 284
start_time datetime Yes NULL When the run started. 2018-07-31 10:57:42
error_message text Yes NULL Error message if the run failed. "weka.classifiers.bayes.HNB: Cannot handle numeric attributes!"
run_details text Yes NULL Additional run details or logs. Not really used. "Custom:Hyperparameter"
visibility varchar(128) No 'public' Visibility level, 'public' or 'private'. public

run_evaluated

Tracks whether a run has been evaluated by an evaluation engine. Currently only one evaluation engine is evaluating runs.

Column Type Optional Default References Description Example
run_id int unsigned No run.rid Run ID. 24
evaluation_engine_id int No evaluation_engine.id Engine that evaluated the run. 1
user_id int No User who triggered evaluation. 381
evaluation_date datetime No When evaluation completed. 2024-02-04 08:09:27
error text Yes NULL Error message if evaluation failed. "Index 1 out of bounds for length 1"
warning text Yes NULL Warning messages from evaluation. "Inconsistent Evaluation score: ..."
num_tries int No 1 Number of evaluation attempts. 1

runfile

Stores metadata about files associated with a run (e.g., predictions, model files).

Column Type Optional Default References Description Example
source int unsigned No run.rid Run ID. 28
field varchar(128) No File field name (e.g., predictions, model serialized). predictions
name varchar(128) No Original filename. weka_generated_run5258986433356798974.xml
format varchar(128) No File format. xml
upload_time datetime Yes NULL When the file was uploaded. 2017-05-29 19:28:56
file_id int No openml.file.id Reference to the file record. 152

run_tag

User-assigned tags on runs.

Column Type Optional Default References Description Example
id int unsigned No run.rid Run ID. 1521
tag varchar(255) No Tag string. micro_benchmark
uploader mediumint unsigned No openml.users.id User who added the tag. 124
date timestamp No CURRENT_TIMESTAMP When the tag was added. 2024-10-24 22:58:18

input_data

Records the input datasets used by a run. Used by PHP, but arguably they could use the relationship run->task->task_inputs->"source_data" instead.

Column Type Optional Default References Description Example
run int No Run ID. 21
data int No Dataset ID. 42
name varchar(128) No 'inputdata' Always 'dataset' dataset

evaluation_engine

Defines the evaluation engines that compute metrics on runs and datasets. Currently only has one row, as we only ever used on engine in production.

Column Type Optional Default References Description Example
id int No auto_increment Primary key. 1
name varchar(256) No Engine name. weka_engine
description text No Engine description. "Default OpenML evaluation engine"

math_function

Defines evaluation metrics (e.g., accuracy, AUC, RMSE) and their properties.

Column Type Optional Default References Description Example
id int No auto_increment Primary key. 1
name varchar(64) No Metric name (unique). EuclidianDistance
functionType varchar(128) No 'EvaluationFunction' Type of function. One of Metric, KernelFunction, or EvaluationFunction Metric
min varchar(64) No Minimum possible value. 0
max varchar(64) No Maximum possible value. ''
unit varchar(64) No Unit of measurement. seconds, bytes
higherIsBetter varchar(16) Yes NULL Whether higher values indicate better performance. true, Yes, 1, False
description text Yes NULL Description of the metric. "The area under the ROC..."
source_code text No Source code or formula. "public double truePositiveRate(..."
date timestamp No CURRENT_TIMESTAMP When the function was defined. 2024-04-19 23:51:52

evaluation

Stores aggregated evaluation results for a run (one value per metric).

Column Type Optional Default References Description Example
source int unsigned No run.rid Run ID. 1
function_id int No math_function.id Evaluation metric. 4
evaluation_engine_id int No evaluation_engine.id Engine that computed the evaluation. 1
value double Yes NULL Aggregated metric value. 0.839
stdev double Yes NULL Standard deviation across folds/repeats. 0.06
array_data text Yes NULL Per-class or detailed results as array. "[0.0,0.99113,0.898048,0.874862,0.791282,0.807343,0.820674]"

evaluation_fold

Stores per-fold evaluation results for a run.

Column Type Optional Default References Description Example
source int unsigned No run.rid Run ID. 12
function_id int No math_function.id Evaluation metric. 4
evaluation_engine_id int No evaluation_engine.id Engine that computed the evaluation. 1
fold int unsigned No 0 Fold number. 0
repeat int unsigned No 0 Repeat number. 0
value double Yes NULL Metric value for this fold/repeat. 0.5
array_data text Yes NULL Per-class or detailed results as array. "[0.4, 0.6]"

evaluation_sample

Stores per-sample evaluation results (for learning curves). Evaluation samples seem to be part of a run upload, but might not be able to be retrieved with the current PHP API.

Column Type Optional Default References Description Example
source int unsigned No run.rid Run ID. 12
function_id int No math_function.id Evaluation metric. 4
evaluation_engine_id int No evaluation_engine.id Engine that computed the evaluation. 1
repeat int unsigned No 0 Repeat number. 0
fold int unsigned No 0 Fold number. 0
sample int unsigned No 0 Sample index. 0
sample_size int No Number of training instances in this sample. 100
value double Yes NULL Metric value for this sample. 0.5
array_data text Yes NULL Per-class or detailed results as array. "[0.4,0.6]"

trace

Stores optimization traces (e.g., hyperparameter search iterations) for a run.

Column Type Optional Default References Description Example
run_id int unsigned No run.rid Run ID. 2
evaluation_engine_id int No 1 evaluation_engine.id Engine that evaluated the trace. 1
repeat int No Repeat number. 0
fold int No Fold number. 0
iteration int No Iteration number in the optimization. 0
setup_string text No Hyperparameter configuration tried. '{"parameter_minNumObj":"1","parameter_confidenceFactor":"0.1"}'
evaluation varchar(265) No Evaluation result for this iteration. 94.12
selected enum('true','false') No Whether this was the selected configuration. true

Studies

Studies have historically been in flux, but they are generally collections of objects (e.g., tasks). One major change during OpenML's lifetime was in how those collections were defined, which happened around 2019. Nowadays there are dedicated tables (e.g., study_task) to make the connection between a study and its task. Historically, what is now referred to as a "legacy study", this association was achieved through tags. E.g., all tasks with tag study_14 would be considered part of study_14. Some studies are still defined this way, and migration of these studies to new-style studies is planned. The Python-based REST API will not support legacy-style studies (hence the data migration needs to occur to make the legacy studies usable with the new API).

The current use is that "benchmark suite" refers to a collection of tasks and a "benchmark study" refers to a collection of runs. Both are found in the "study" table. It is likely we drop this distinction in the future in favor of a more general "collection".

study

Represents a collection of runs or tasks, used for benchmarking and reproducibility.

Column Type Optional Default References Description Example
id int No auto_increment Primary key (study ID). 2
alias varchar(32) Yes NULL Short unique alias for the study. automl_benchmark
main_entity_type enum('run','task') No 'run' Whether the study collects runs or tasks. tasks
benchmark_suite int Yes NULL study.id Reference to a task study used as benchmark suite. 2
name varchar(256) No Study name. "A friendly benchmarking suite for AutoML systems"
description text No Study description. "See also JMLR..."
visibility varchar(64) No 'public' Visibility level. Only ever 'public' public
status enum('in_preparation','active','deactivated') No 'in_preparation' Current status. active
creation_date datetime No When the study was created. 2021-02-04 20:48:58
creator mediumint unsigned No openml.users.id User who created the study. 3
legacy enum('y','n') No 'y' Whether this is a legacy study. y

study_tag

Tags applied to studies, with optional time windows and access control. It is unclear why this table has a different design than the other tag tables, but it likely has to do with the "legacy" study setup.

Column Type Optional Default References Description Example
study_id int No study.id Study ID. 28
tag varchar(255) No Tag string. published, jmlr
window_start datetime Yes NULL Start of the tag's validity window. 2020-01-01 12:23:42
window_end datetime Yes NULL End of the tag's validity window. 2021-04-11 19:54:24
write_access enum('private','public') No 'private' Who can add this tag. private

run_study

A join table for the "new" style studies. Associates runs with studies (many-to-many).

Column Type Optional Default References Description Example
study_id int No study.id Study ID. 2
run_id int unsigned No run.rid Run ID. 102984
uploader mediumint unsigned No openml.users.id User who added the run to the study. 1815
date timestamp No CURRENT_TIMESTAMP When the run was added. 2024-05-12 23:10:54

task_study

A join table for the "new" style studies. Associates tasks with studies (many-to-many).

Column Type Optional Default References Description Example
study_id int No study.id Study ID. 5
task_id int No task.task_id Task ID. 18257
uploader mediumint unsigned No openml.users.id User who added the task to the study. 12
date timestamp No CURRENT_TIMESTAMP When the task was added. 2022-04-28 08:28:49

Community

The awarded_badges, likes, downvotes and downvote_reasons tables are not currently in use. They may be added back with the new frontend (except for awarded_badges).

likes

Records user likes on datasets, flows, runs, or other entities.

Column Type Optional Default References Description Example
lid int No auto_increment Unique ID. 1
knowledge_type varchar(1) No Entity type code (e.g., 'd' for dataset). d, f, r, t
knowledge_id int No Entity ID. 20248
user_id mediumint No User who liked the entity. 241
time timestamp No CURRENT_TIMESTAMP When the like was recorded. 2019-09-08 17:47:57

downvotes

Records user downvotes on entities with a reason.

Column Type Optional Default References Description Example
did int No auto_increment Unique ID.
knowledge_type varchar(1) No Entity type code.
knowledge_id int No Entity ID.
user_id mediumint No User who downvoted.
reason int No Reason for the downvote (references downvote_reasons).
time timestamp No CURRENT_TIMESTAMP When the downvote was recorded.
original tinyint No 0 Whether this is the original downvote (vs. an update).

downvote_reasons

Defines the reasons for downvoting an entity.

Column Type Optional Default References Description Example
reason_id int No auto_increment Primary key.
description varchar(256) No Description of the reason.

downloads

Tracks download counts per user and entity.

Column Type Optional Default References Description Example
did int No auto_increment Unique ID.
knowledge_type varchar(1) No Entity type code.
knowledge_id int No Entity ID.
user_id mediumint No User who downloaded.
count smallint No 1 Number of downloads.
time timestamp No CURRENT_TIMESTAMP Last download time.

Miscellaneous

The notebook and pdnresults are no longer in use.

schedule

Defines scheduled experiment jobs to be executed. Seems to be out of use since 2017.

Column Type Optional Default References Description Example
sid int No Setup ID.
task_id int No Task ID to run.
experiment varchar(128) No Experiment identifier.
active enum('true','false') No 'true' Whether the schedule is active.
last_assigned datetime Yes NULL When this job was last assigned to a worker.
ttid int No Task type ID.
dependencies varchar(128) No Dependency identifiers.
setup_string text No Serialized setup configuration.

kaggle

Maps OpenML datasets to Kaggle competitions or datasets. Defined by hand in collaboration with Kaggle. Only used by the frontend.

Column Type Optional Default References Description Example
dataset_id int Yes NULL OpenML dataset ID. 6
kaggle_link varchar(500) Yes NULL URL to the Kaggle page. https://www.kaggle.com/datasets/nishan192/letterrecognition-using-svm