openml_expdb Database
The openml_expdb database contains all experiment-related data: datasets, tasks, flows (implementations), runs, evaluations, and studies.
The "expdb" part stands for "experiment database", the name used in Joaquin Vanschoren's thesis.
Some remarks which apply generally:
datetimefields are in format (YYYY-MM-DD hh:mm:ss).- some
varcharcolumns in production only have a very limited set of values. If the description says "one of.." it denotes only those values are present in the database in production.
There are a few tables which never were or no longer are in use.
They will be removed and so are not documented here. They are: data_quality_interval, feature_quality, output_data (only present on the test server), algorithm and algorithm_quality.
For more information, see server-api#165.
Datasets
dataset
Stores dataset metadata including source, format, licensing, and upload information.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | auto_increment | Primary key (dataset ID). | 42 | |
| uploader | mediumint unsigned | Yes | NULL | openml.users.id | User who uploaded the dataset. | 2 |
| source | int unsigned | Yes | NULL | ID of the original source dataset if derived. Used only rarely. | 3 | |
| name | varchar(128) | No | Dataset name. | iris | ||
| version | varchar(64) | No | Version string, displayed on the dataset page. | 3 | ||
| version_label | varchar(128) | Yes | NULL | Human-readable label. | tabular-ml-iid-study-0.0.1 | |
| format | varchar(64) | No | 'arff' | File format, one of: ARFF, Sparse_ARFF, CSV, "CSV, MAT", txt, Rimage | ARFF | |
| creator | text | Yes | NULL | Original creator(s) of the dataset. | "Jason", "G. Davies, A. Horne" | |
| contributor | text | Yes | NULL | People who contributed to the current version (e.g., formatting). | "A. Dent", "The Data Institute" | |
| collection_date | varchar(128) | Yes | NULL | When the data was originally collected, in any format. | "1980", "2024-10-30", "2022, 5 May" | |
| upload_date | datetime | No | When the dataset was uploaded to OpenML. | 2014-04-06 23:19:20 | ||
| language | varchar(128) | Yes | NULL | Language of the dataset content. | "En", "Dutch" | |
| licence | varchar(64) | No | 'Public' | License under which the dataset is shared. | Public | |
| citation | text | Yes | NULL | How to cite this dataset, sometimes a link to the policy. | Bareiss, E. Ray, & [...] Proceedings of [...] | |
| collection | varchar(64) | Yes | NULL | Collection the dataset belongs to. Not used. | NULL | |
| url | mediumtext | No | URL to download the dataset file. Most often links to OpenML, but not always. | https://openml.org/data/download/22126628/test.arff | ||
| parquet_url | mediumtext | Yes | NULL | NOT IN PRODUCTION URL to Parquet version of the dataset. | ||
| isOriginal | enum('true','false') | Yes | NULL | Whether this is an original dataset (not derived). | true | |
| file_id | int | Yes | NULL | openml.file.id | Reference to the uploaded file. | 60 |
| default_target_attribute | varchar(1024) | Yes | NULL | Name of the default target column. Allows csv. | class | |
| row_id_attribute | varchar(128) | Yes | NULL | Name of the row identifier column. Allows csv. | id | |
| ignore_attribute | varchar(128) | Yes | NULL | Columns to ignore during modeling. Allows csv. | "animal", "'url_hash', 'query_id'" | |
| paper_url | mediumtext | Yes | NULL | URL to an associated publication. | https://arxiv.org/html/2402.55618v3 | |
| visibility | varchar(128) | No | 'public' | Visibility level (public, private, or friends). Note non-public visibility are currently not supported. | public | |
| original_data_id | int | Yes | NULL | ID of the original dataset this was derived from. | 23 | |
| original_data_url | mediumtext | Yes | NULL | URL to the original data source. | https://zenodo.org/record/322475/files/bike.arff | |
| update_comment | text | Yes | NULL | Comment explaining the latest update. | fixed features | |
| last_update | datetime | Yes | NULL | Timestamp of the last update. | 2017-10-28 23:42:18 |
dataset_description
Stores versioned descriptions for datasets.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | dataset.did | Dataset this description belongs to. | 42 | |
| version | int unsigned | No | Description version number. | 2 | ||
| description | text | No | The dataset description text. | This dataset describes... | ||
| uploader | mediumint unsigned | No | openml.users.id | User who wrote this description. | 3 |
dataset_status
Tracks the current status of the dataset. The absence of an entry in this table for a dataset denotes that the dataset is "in preparation". Rows are removed from this table to indicate transitions (deactivated -> activate)! It does not provide a historical record of the dataset status.
Allowed transitions are (as defined by the PHP implementation):
- in preparation -> activate
- in preparation -> deactivated
- deactivated -> activate
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | dataset.did | Dataset ID. | 42 | |
| status | enum('active','deactivated') | No | Status value. | active | ||
| status_date | datetime | No | When the status was set. | 2022-04-10 11:10:42 | ||
| user_id | mediumint unsigned | No | openml.users.id | User who changed the status. | 1 |
dataset_tag
User-assigned tags on datasets for categorization and search.
Note that historically, collections used tags (e.g., study_14 indicates the dataset is linked to study 14).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int unsigned | No | dataset.did | Dataset ID. | 3 | |
| tag | varchar(255) | No | Tag string. | Medicine, study_14 | ||
| uploader | mediumint unsigned | No | openml.users.id | User who added the tag. | 4 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the tag was added. | 2018-11-04 08:57:17 |
dataset_topic
Assigns topic labels to datasets. Topics are displayed as tags on the web page. This is the result of an experiment in 2021 to try to categorize datasets to better facilitate search. Topics have been added by automated analysis.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int unsigned | No | dataset.did | Dataset ID. | 5 | |
| topic | varchar(255) | No | Topic label. | Health | ||
| uploader | mediumint unsigned | No | openml.users.id | User who assigned the topic. | 8111 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the topic was assigned. | 2021-06-01 12:32:45 |
Data Features and Qualities
data_feature
Stores metadata and statistics for each feature (column) of a dataset, as computed by an evaluation engine.
What's going on?
We need more documentation for the computed columns NumberOf* and ClassDistribution.
From just the database values, it's unclear what's going on.
For example, 'nominal' features have their NumberOfIntegerValues column populated.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | dataset.did | Dataset ID. | 2 | |
| index | int unsigned | No | Feature index (column position), 0-indexed. | 0 | ||
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that computed the feature metadata. | 1 | |
| name | varchar(64) | No | Feature name. | petal-width | ||
| data_type | varchar(64) | Yes | NULL | Data type (numeric, nominal, string, date, unknown). | numeric | |
| is_target | enum('true','false') | No | 'false' | Whether this is the default target feature. | true | |
| is_row_identifier | enum('true','false') | No | 'false' | Whether this feature is a row ID. | false | |
| is_ignore | enum('true','false') | No | 'false' | Whether this feature should be ignored. | false | |
| NumberOfDistinctValues | int | Yes | NULL | Number of distinct values. | 14972 | |
| NumberOfUniqueValues | int | Yes | NULL | Number of values that appear exactly once. | 0 | |
| NumberOfMissingValues | int | No | Count of missing values. | 124 | ||
| NumberOfIntegerValues | int | Yes | NULL | Count of integer values. | ||
| NumberOfRealValues | int | Yes | NULL | Count of real-valued entries. | ||
| NumberOfNominalValues | varchar(512) | No | Number of nominal categories (or distribution). | |||
| NumberOfValues | int | No | Total number of values. | |||
| MaximumValue | int | Yes | NULL | Maximum value (for numeric features). | ||
| MinimumValue | int | Yes | NULL | Minimum value (for numeric features). | ||
| MeanValue | int | Yes | NULL | Mean value (for numeric features). | ||
| StandardDeviation | int | Yes | NULL | Standard deviation (for numeric features). | ||
| ClassDistribution | text | Yes | NULL | Only for nominal features, otherwise '[]'. See below. | [["?","T"],[[7, 99, 586, 0, 67, 2],[1, 0, 98, 0, 0, 38]]] |
ClassDistribution is [] for all features except nominal ones, and only populated if the dataset has a target feature that is also nominal (note that a dataset can have multiple target features).
For nominal features, it is a U x C matrix where U is the number of unique values in the feature and C is the number of classes of the target feature of the dataset.
Each cell specifies how often the feature value occurs for the specific target value.
The behavior with multiple target features that are nominal is unspecified.
data_feature_description
User-provided descriptions and ontology annotations for individual data features. The table is empty in production (as of 2026-04-29) and should be considered experimental. This feature was added recently (2025ish) should be considered experimental.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | data_feature(did, index) | Dataset ID. | 2 | |
| index | int unsigned | No | data_feature(did, index) | Feature index (column position, 0-indexed). | 0 | |
| uploader | mediumint unsigned | No | User who added the description. | 1548 | ||
| date | timestamp | No | CURRENT_TIMESTAMP | When the description was added. | 2025-12-12 09:29:30 | |
| description_type | enum('plain','ontology') | No | Type of description. | 'ontology' | ||
| value | varchar(256) | No | The description or ontology URI. | http://xmlns.com/foaf/0.1/#name |
data_feature_value
Enumerates the distinct values of a feature (only for nominal features).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | data_feature(did, index) | Dataset ID. | 2 | |
| index | int unsigned | No | data_feature(did, index) | Feature index. | 3 | |
| value | varchar(256) | No | A distinct value of the feature. | 'tulip' |
data_quality
Stores computed quality measures (meta-features) for datasets, such as number of instances or entropy. Measures are computed by an evaluation engine.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| data | int unsigned | No | 0 | dataset.did | Dataset ID. | 2 |
| quality | varchar(128) | No | quality.name | Name of the quality measure. | 'AutoCorrelation' | |
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that computed the quality. | 1 | |
| value | varchar(128) | Yes | NULL | Computed value. | 0.5 | |
| description | text | Yes | NULL | Additional description or notes. |
data_processed
Tracks whether a dataset has been processed by an evaluation engine (feature extraction, quality computation).
Note that this table is used with in-place edits. Retrying a failed will increment num_tries, not add a new row.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int unsigned | No | dataset.did | Dataset ID. | 1 | |
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that processed the dataset. | 1 | |
| user_id | int | No | User who triggered processing. | 1 | ||
| processing_date | datetime | No | When processing completed. | 2020-02-04 12:45:01 | ||
| error | text | Yes | NULL | Error message if processing failed. | "keyword @relation expected, read Token[Id], line 1" | |
| warning | text | Yes | NULL | Warning messages from processing. Always NULL on prod. | NULL | |
| num_tries | int | No | 1 | Number of processing attempts. | 3 |
quality
Defines the available dataset quality measures (meta-features) and their properties.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| name | varchar(128) | No | Primary key. Name of the quality measure. | AutoCorrelation | ||
| type | varchar(128) | No | 'DataProperty' | Category, one of DataQuality, FeatureQuality, AlgorithmQuality. | FeatureQuality | |
| formula | text | Yes | NULL | Formula or computation method. Always NULL except for 1 row. | Calculated using... | |
| description | text | Yes | NULL | Human-readable description. | Average class difference between consecutive instances. | |
| datatype | varchar(128) | No | 'undefined' | Expected data type of the value. One of 'double', 'integer', 'undefined'. | ||
| min | float | Yes | NULL | Minimum possible value. | 0 | |
| max | float | Yes | NULL | Maximum possible value. | 1.0 | |
| unit | varchar(32) | Yes | NULL | Unit of measurement. | NULL | |
| priority | int | No | 9999 | Display priority (lower is higher priority). | 2 | |
| showonline | enum('true','false') | No | Whether to show on the website. | true | ||
| date | timestamp | No | CURRENT_TIMESTAMP | When the quality was defined. | 2014-12-31 23:00:00 |
Tasks
To more easily understand the way tasks are structured, we break down tasks into three components:
- the task type: concept of what the task is (e.g., Supervised Classification) and the
task_type_inoutspecifies the required and optional inputs and outputs for such a task. For example, a Supervised Classification task must have (among others) a target feature specified. - the estimation procedure: the experimental setup to evaluate model quality for the task (e.g., 10-fold Cross Validation)
- the
tasktogether with itstask_inputsspecify concrete values for the task (e.g., 10-fold Cross Validation on Iris predicting Class).
For example, task 59 (10-fold Cross Validation on Iris) has the task row in the database (task_id=59, ttid=1, ...):
- It is of task type 1, Supervised Classification (from resolving its
ttidcolumn against thetask_typetable). - When the task was created, the creator had to specify the following input since they are
requiredfor the task type as dictated bytask_type_inout:- source data: to which dataset is the task linked?
- estimation procedure: what is the experimental setup for the runs?
- target feature: what is the target to predict?
- It could have optionally also specified the following properties, as indicated by the
task_type_inouttable: cost matrix, custom testset, evaluation measures. - The
task_inputstable is to specify what the corresponding values are for this task. E.g., the row with input "source_data" has value "61" specifying that dataset 61 is used. It similarly defines the estimation procedure, target feature, and also the optional input of "evaluation measure". That is to say,task_inputsprimarily references other tables. - The estimation procedure required by the task type, and given a value in
task_inputs, must correspond to an entry in theestimation_proceduretable. E.g., if we find (estimation_procedure,1) as a task input, we know it is 10-fold Cross Validation (the row withid=1inestimation_procedure). task_type_inoutfurther specifies that they expectoutputof experiments of the task may contain: evaluation measures, model, predictions. This is not directly used for tasks, but rather for runs.
So there are constraints from the task_inputs table to multiple other tables (dataset, estimation_procedure, ...) that are not explicitly present in the database.
The task_inputs is also expected to be populated with entries for each task dependent on the respective input defined by the task_type_inout table.
The estimation_procedure_type table provides general descriptions for procedures (such as Hold out), they are matched by name even though the relationship is not explicit in the database.
task_type
Defines the types of machine learning tasks available (e.g., classification, regression).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| ttid | int | No | auto_increment | Primary key (task type ID). | 1 | |
| name | varchar(128) | No | Task type name. | Supervised Classification | ||
| description | text | No | Description of the task type. | Predict the value of a nominal feature given the other features. | ||
| creator | varchar(128) | No | Who defined this task type. | "Alice Smith, John Hickey" | ||
| contributors | text | Yes | NULL | Additional contributors. | "Piet Houten, Betty Boss" | |
| creationDate | datetime | No | When the task type was created. | 2017-09-28 11:23:09 |
task_type_inout
Defines the input and output specifications for each task type.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| ttid | int | No | task_type.ttid | Task type ID. | 1 | |
| name | varchar(64) | No | Name of the input/output parameter. See note. | source_data | ||
| type | varchar(64) | No | Data type of the parameter, from task_io_types. | String, "Estimation Procedure" | ||
| io | enum('input','output') | No | Whether this is an input or output. | input | ||
| requirement | enum('required','optional','hidden') | No | Whether this parameter is required. | required | ||
| description | varchar(256) | No | Description of the parameter. | "This input is required to foo the bar." | ||
| order | int | No | Display order used by the frontend. | 29 | ||
| api_constraints | text | Yes | NULL | API-level constraints on this parameter as JSON. | See below. | |
| template_api | text | Yes | NULL | Template for API representation. | See below. | |
| template_search | text | Yes | NULL | Template for search representation. | See below. |
The api_constraints, template_api and template_search columns contain values that are used by PHP to help format responses.
The api_constraints contains JSON with optionally some special instructions:
{
"data_type": "string",
"select": "name",
"from": "data_feature",
"where": "did = '[INPUT:source_data]' AND data_type = 'nominal'"
}
[INPUT:source_data]: look up the value oftask_inputs.valuewhereinput="source_data"andtask_idmatches the task.[TASK:ttid]: look up the value oftask.ttidfor that task.
The template_api contains XML instead:
<oml:estimation_procedure>
<oml:id>[INPUT:estimation_procedure]</oml:id>
<oml:type>[LOOKUP:estimation_procedure.type]</oml:type>
<oml:data_splits_url>[CONSTANT:base_url]api_splits/get/[TASK:id]/Task_[TASK:id]_splits.arff</oml:data_splits_url>
<oml:parameter name="number_repeats">[LOOKUP:estimation_procedure.repeats]</oml:parameter>
<oml:parameter name="number_folds">[LOOKUP:estimation_procedure.folds]</oml:parameter>
<oml:parameter name="percentage">[LOOKUP:estimation_procedure.percentage]</oml:parameter>
<oml:parameter name="stratified_sampling">[LOOKUP:estimation_procedure.stratified_sampling]</oml:parameter>
</oml:estimation_procedure>
[LOOKUP:*]: looks up theTABLE.rowvalue that are associated with the task.[CONSTANT:*]: look up constants configured in PHP, not in the database.
Finally, the template_search contains JSON again:
{
"name": "Dataset(s)",
"autocomplete": "commaSeparated",
"datasource": "expdbDatasetVersion()",
"placeholder": "(*) include all datasets"
}
The expdbDatasetVersion() function is no longer used.
task_io_types
Defines the valid types for task inputs and outputs.
Used in the task_type_inout table.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| name | varchar(64) | No | Primary key. Type name. | String | ||
| description | text | No | Description of this I/O type. | A string, possibly contains csv values. |
task
Represents a specific machine learning task.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| task_id | int | No | auto_increment | Primary key. | 59 | |
| ttid | int | No | task_type.ttid | Task type. | 1 | |
| creator | mediumint unsigned | Yes | NULL | openml.users.id | User who created the task. | 1 |
| creation_date | datetime | Yes | NULL | When the task was created. | 2014-11-02 03:12:15 | |
| embargo_end_date | datetime | Yes | NULL | End date of any data embargo. | 2025-10-30 23:52:28 |
task_inputs
Stores the input parameter values for a specific task instance.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| task_id | int | No | task.task_id | Task ID. | 59 | |
| input | varchar(64) | No | Input parameter name. | source_data | ||
| value | text | No | Input parameter value. | 61 |
Valid input values depend on the task_type_inout for the task_type that's specified in the task.
task_tag
User-assigned tags on tasks.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | task.task_id | Task ID. | 59 | |
| tag | varchar(255) | No | Tag string. | uci | ||
| uploader | mediumint unsigned | No | openml.users.id | User who added the tag. | 2 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the tag was added. | 2022-10-09 17:18:19 |
estimation_procedure_type
Defines the types of estimation procedures (e.g., cross-validation, holdout).
Variations are defined in the estimation_procedure table.
E.g., this table describes "cross validation" and estimation_procedure describes 10-fold cross validation, 5-fold cross validation, and so on.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| name | varchar(64) | No | Primary key. Procedure type name. | crossvalidation | ||
| description | text | No | Description of the procedure type. | a process where the dataset is divided into folds, ... |
estimation_procedure
Defines specific estimation procedure configurations used in tasks.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | auto_increment | Primary key. | 1 | |
| ttid | int | No | task_type.ttid | Task type this procedure applies to. | 1 | |
| name | varchar(128) | No | Procedure name. | 10-fold cross validation | ||
| type | enum('crossvalidation','leaveoneout','holdout','holdout_ordered','bootstrapping','subsampling','testthentrain','holdoutunlabeled','customholdout','testontrainingdata') | No | Procedure type. | crossvalidation | ||
| repeats | int | Yes | NULL | Number of repetitions. | 1 | |
| folds | int | Yes | NULL | Number of folds. | 10 | |
| samples | enum('false','true') | No | 'false' | Whether learning curve samples are used. | false | |
| percentage | int | Yes | NULL | Train/test split percentage. | NULL | |
| stratified_sampling | enum('true','false') | Yes | NULL | Whether stratified sampling is used. | true | |
| custom_testset | enum('true','false') | No | 'false' | Whether a custom test set is provided. | false | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the procedure was created. | 2020-02-20 20:02:20 |
Flows (Implementations)
The database uses historical name 'implementation' for a flow.
implementation
Stores machine learning flows (algorithms/pipelines) that can be executed on tasks.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | auto_increment | Primary key (flow ID). | 1 | |
| fullName | varchar(1024) | No | Fully qualified name (name + version). | sklearn.tree.DecisionTreeClassifier(1) | ||
| uploader | mediumint unsigned | Yes | NULL | openml.users.id | User who uploaded the flow. | 2 |
| name | varchar(1024) | No | Flow name. | sklearn.tree.DecisionTreeClassifier | ||
| custom_name | varchar(256) | Yes | NULL | User-defined display name. | Tree | |
| class_name | varchar(256) | Yes | NULL | Class name in the source library. | nl.liacs.subdisc.SubgroupDiscovery | |
| version | int | No | Internal version number. | 1 | ||
| external_version | varchar(128) | No | Version string from the source library. | "sklearn=0.12.3,numpy=0.22.1" | ||
| creator | varchar(128) | Yes | NULL | Original creator of the algorithm. | Arlo Knoppen | |
| contributor | text | Yes | NULL | Additional contributors. | "Marcel Alder" | |
| uploadDate | datetime | No | When the flow was uploaded. | 2015-12-21 18:28:38 | ||
| licence | varchar(64) | Yes | NULL | License. | public domain | |
| language | varchar(128) | Yes | NULL | Language the description is written in. Sometimes used to specify programming language instead. | English | |
| description | text | Yes | NULL | Short description. | Common Decision Tree algorithm | |
| fullDescription | text | Yes | NULL | Full description. | This algorithm partitions the data... | |
| installationNotes | text | Yes | NULL | Installation instructions. Not really used. | Runs on OpenML | |
| dependencies | text | Yes | NULL | Required dependencies. | mlr_2.3 | |
| implements | varchar(128) | Yes | NULL | Algorithm or standard this flow implements. Seems to be legacy | build_cpu_time | |
| binary_file_id | int | Yes | NULL | File ID of the compiled binary. | 32 | |
| source_file_id | int | Yes | NULL | File ID of the source code. | 33 | |
| visibility | varchar(128) | No | 'public' | Visibility level. One of 'public' or 'private' | public | |
| citation | text | Yes | NULL | How to cite this flow. Only used once. | "A. Boo et al., Journal of ..." |
implementation_component
Defines parent-child relationships between flows (e.g., a pipeline containing sub-flows).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| parent | int | No | implementation.id | Parent flow ID. | 1 | |
| child | int | No | implementation.id | Child (component) flow ID. | 2 | |
| identifier | varchar(1024) | Yes | NULL | Role or name of the component within the parent. | PCA, scaler |
implementation_tag
User-assigned tags on flows.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | implementation.id | Flow ID. | 2 | |
| tag | varchar(255) | No | Tag string. | Verified_Supervised_Classification | ||
| uploader | mediumint unsigned | No | openml.users.id | User who added the tag. | 2 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the tag was added. | 2020-06-08 09:20:28 |
input
Defines the hyperparameters (input parameters) of a flow.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | auto_increment | Primary key. | 2 | |
| implementation_id | int | No | implementation.id | Flow this parameter belongs to. | 1 | |
| name | varchar(512) | Yes | NULL | Parameter name. | C | |
| description | text | Yes | NULL | Parameter description. | Regularization parameter. | |
| dataType | varchar(255) | Yes | NULL | Expected data type. | float | |
| defaultValue | text | Yes | NULL | Default value. | 1.0 | |
| recommendedRange | text | Yes | NULL | Recommended value range. | 10e-3 to 10e3 |
Setups
Setups are really part of a run, they specifically describe the algorithm (flow) configuration used in the run.
They are not a separate entity. Each row in the run table refers to a setup. A setup may be shared by multiple runs.
algorithm_setup
Represents a specific configuration (hyperparameter setting) of a flow, including nested component setups.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| sid | int unsigned | No | auto_increment | Primary key (setup ID). | 1 | |
| parent | int unsigned | No | Parent setup ID (for nested components). | NULL | ||
| implementation_id | int | No | implementation.id | Flow this setup configures. | 3 | |
| algorithm | varchar(128) | Yes | NULL | Algorithm name. Always NULL is prod! | NULL | |
| role | varchar(64) | No | 'Learner' | Role of this component (e.g., Learner, Preprocessor). | learner | |
| isDefault | enum('true','false') | Yes | 'false' | Whether this is the default setup. | true | |
| algorithm_structure | varchar(64) | Yes | NULL | Not sure. Always NULL in prod. | NULL | |
| setup_string | text | Yes | NULL | Serialized setup string. Can aid reproducing the run. | "weka.classifiers.trees.J48 -- -C 0.25 -M 2" |
input_setting
Stores the actual hyperparameter values for a specific setup.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| setup | int unsigned | No | 0 | algorithm_setup.sid | Setup ID. | 1 |
| input | varchar(128) | No | Parameter name. Only old setups have this. | 428_I | ||
| input_id | int | No | input.id | Reference to the parameter definition. | 4124 | |
| value | varchar(2048) | No | Parameter value. | 1000, gini |
setup_tag
User-assigned tags on setups. Setups can not really be assigned by users, only admins. Runs should be tagged by users.
These tags are only used in the filtering of /setups/list, they are not returned with a setup or run.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int unsigned | No | algorithm_setup.sid | Setup ID. | 42 | |
| tag | varchar(255) | No | Tag string. | test_setup | ||
| uploader | mediumint unsigned | No | openml.users.id | User who added the tag. | 2 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the tag was added. | 2024-12-01 23:04:57 |
setup_differences
Precomputed pairwise differences between setups on specific tasks. Not entirely sure what this means, but seems to be out of use (last setup included is from a run in 2019).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| sidA | int | No | First setup ID. | 2 | ||
| sidB | int | No | Second setup ID. | 3 | ||
| task_id | int | No | Task ID. | 1245 | ||
| task_size | int | No | Size of the task dataset. | 100000 | ||
| differences | int | No | ??? | 812987 |
Runs and Evaluations
run
Represents the execution of a flow setup on a task.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| rid | int unsigned | No | auto_increment | Primary key (run ID). | 2 | |
| uploader | mediumint unsigned | Yes | NULL | openml.users.id | User who uploaded the run. | 16 |
| setup | int unsigned | No | 0 | algorithm_setup.sid | Setup (hyperparameter configuration) used. | 2894 |
| task_id | int | Yes | NULL | task.task_id | Task the run was executed on. | 284 |
| start_time | datetime | Yes | NULL | When the run started. | 2018-07-31 10:57:42 | |
| error_message | text | Yes | NULL | Error message if the run failed. | "weka.classifiers.bayes.HNB: Cannot handle numeric attributes!" | |
| run_details | text | Yes | NULL | Additional run details or logs. Not really used. | "Custom:Hyperparameter" | |
| visibility | varchar(128) | No | 'public' | Visibility level, 'public' or 'private'. | public |
run_evaluated
Tracks whether a run has been evaluated by an evaluation engine. Currently only one evaluation engine is evaluating runs.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| run_id | int unsigned | No | run.rid | Run ID. | 24 | |
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that evaluated the run. | 1 | |
| user_id | int | No | User who triggered evaluation. | 381 | ||
| evaluation_date | datetime | No | When evaluation completed. | 2024-02-04 08:09:27 | ||
| error | text | Yes | NULL | Error message if evaluation failed. | "Index 1 out of bounds for length 1" | |
| warning | text | Yes | NULL | Warning messages from evaluation. | "Inconsistent Evaluation score: ..." | |
| num_tries | int | No | 1 | Number of evaluation attempts. | 1 |
runfile
Stores metadata about files associated with a run (e.g., predictions, model files).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| source | int unsigned | No | run.rid | Run ID. | 28 | |
| field | varchar(128) | No | File field name (e.g., predictions, model serialized). | predictions | ||
| name | varchar(128) | No | Original filename. | weka_generated_run5258986433356798974.xml | ||
| format | varchar(128) | No | File format. | xml | ||
| upload_time | datetime | Yes | NULL | When the file was uploaded. | 2017-05-29 19:28:56 | |
| file_id | int | No | openml.file.id | Reference to the file record. | 152 |
run_tag
User-assigned tags on runs.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int unsigned | No | run.rid | Run ID. | 1521 | |
| tag | varchar(255) | No | Tag string. | micro_benchmark | ||
| uploader | mediumint unsigned | No | openml.users.id | User who added the tag. | 124 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the tag was added. | 2024-10-24 22:58:18 |
input_data
Records the input datasets used by a run. Used by PHP, but arguably they could use the relationship run->task->task_inputs->"source_data" instead.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| run | int | No | Run ID. | 21 | ||
| data | int | No | Dataset ID. | 42 | ||
| name | varchar(128) | No | 'inputdata' | Always 'dataset' | dataset |
evaluation_engine
Defines the evaluation engines that compute metrics on runs and datasets. Currently only has one row, as we only ever used on engine in production.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | auto_increment | Primary key. | 1 | |
| name | varchar(256) | No | Engine name. | weka_engine | ||
| description | text | No | Engine description. | "Default OpenML evaluation engine" |
math_function
Defines evaluation metrics (e.g., accuracy, AUC, RMSE) and their properties.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | auto_increment | Primary key. | 1 | |
| name | varchar(64) | No | Metric name (unique). | EuclidianDistance | ||
| functionType | varchar(128) | No | 'EvaluationFunction' | Type of function. One of Metric, KernelFunction, or EvaluationFunction | Metric | |
| min | varchar(64) | No | Minimum possible value. | 0 | ||
| max | varchar(64) | No | Maximum possible value. | '' | ||
| unit | varchar(64) | No | Unit of measurement. | seconds, bytes | ||
| higherIsBetter | varchar(16) | Yes | NULL | Whether higher values indicate better performance. | true, Yes, 1, False | |
| description | text | Yes | NULL | Description of the metric. | "The area under the ROC..." | |
| source_code | text | No | Source code or formula. | "public double truePositiveRate(..." | ||
| date | timestamp | No | CURRENT_TIMESTAMP | When the function was defined. | 2024-04-19 23:51:52 |
evaluation
Stores aggregated evaluation results for a run (one value per metric).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| source | int unsigned | No | run.rid | Run ID. | 1 | |
| function_id | int | No | math_function.id | Evaluation metric. | 4 | |
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that computed the evaluation. | 1 | |
| value | double | Yes | NULL | Aggregated metric value. | 0.839 | |
| stdev | double | Yes | NULL | Standard deviation across folds/repeats. | 0.06 | |
| array_data | text | Yes | NULL | Per-class or detailed results as array. | "[0.0,0.99113,0.898048,0.874862,0.791282,0.807343,0.820674]" |
evaluation_fold
Stores per-fold evaluation results for a run.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| source | int unsigned | No | run.rid | Run ID. | 12 | |
| function_id | int | No | math_function.id | Evaluation metric. | 4 | |
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that computed the evaluation. | 1 | |
| fold | int unsigned | No | 0 | Fold number. | 0 | |
| repeat | int unsigned | No | 0 | Repeat number. | 0 | |
| value | double | Yes | NULL | Metric value for this fold/repeat. | 0.5 | |
| array_data | text | Yes | NULL | Per-class or detailed results as array. | "[0.4, 0.6]" |
evaluation_sample
Stores per-sample evaluation results (for learning curves). Evaluation samples seem to be part of a run upload, but might not be able to be retrieved with the current PHP API.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| source | int unsigned | No | run.rid | Run ID. | 12 | |
| function_id | int | No | math_function.id | Evaluation metric. | 4 | |
| evaluation_engine_id | int | No | evaluation_engine.id | Engine that computed the evaluation. | 1 | |
| repeat | int unsigned | No | 0 | Repeat number. | 0 | |
| fold | int unsigned | No | 0 | Fold number. | 0 | |
| sample | int unsigned | No | 0 | Sample index. | 0 | |
| sample_size | int | No | Number of training instances in this sample. | 100 | ||
| value | double | Yes | NULL | Metric value for this sample. | 0.5 | |
| array_data | text | Yes | NULL | Per-class or detailed results as array. | "[0.4,0.6]" |
trace
Stores optimization traces (e.g., hyperparameter search iterations) for a run.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| run_id | int unsigned | No | run.rid | Run ID. | 2 | |
| evaluation_engine_id | int | No | 1 | evaluation_engine.id | Engine that evaluated the trace. | 1 |
| repeat | int | No | Repeat number. | 0 | ||
| fold | int | No | Fold number. | 0 | ||
| iteration | int | No | Iteration number in the optimization. | 0 | ||
| setup_string | text | No | Hyperparameter configuration tried. | '{"parameter_minNumObj":"1","parameter_confidenceFactor":"0.1"}' | ||
| evaluation | varchar(265) | No | Evaluation result for this iteration. | 94.12 | ||
| selected | enum('true','false') | No | Whether this was the selected configuration. | true |
Studies
Studies have historically been in flux, but they are generally collections of objects (e.g., tasks).
One major change during OpenML's lifetime was in how those collections were defined, which happened around 2019.
Nowadays there are dedicated tables (e.g., study_task) to make the connection between a study and its task.
Historically, what is now referred to as a "legacy study", this association was achieved through tags.
E.g., all tasks with tag study_14 would be considered part of study_14.
Some studies are still defined this way, and migration of these studies to new-style studies is planned.
The Python-based REST API will not support legacy-style studies (hence the data migration needs to occur to make the legacy studies usable with the new API).
The current use is that "benchmark suite" refers to a collection of tasks and a "benchmark study" refers to a collection of runs. Both are found in the "study" table. It is likely we drop this distinction in the future in favor of a more general "collection".
study
Represents a collection of runs or tasks, used for benchmarking and reproducibility.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| id | int | No | auto_increment | Primary key (study ID). | 2 | |
| alias | varchar(32) | Yes | NULL | Short unique alias for the study. | automl_benchmark | |
| main_entity_type | enum('run','task') | No | 'run' | Whether the study collects runs or tasks. | tasks | |
| benchmark_suite | int | Yes | NULL | study.id | Reference to a task study used as benchmark suite. | 2 |
| name | varchar(256) | No | Study name. | "A friendly benchmarking suite for AutoML systems" | ||
| description | text | No | Study description. | "See also JMLR..." | ||
| visibility | varchar(64) | No | 'public' | Visibility level. Only ever 'public' | public | |
| status | enum('in_preparation','active','deactivated') | No | 'in_preparation' | Current status. | active | |
| creation_date | datetime | No | When the study was created. | 2021-02-04 20:48:58 | ||
| creator | mediumint unsigned | No | openml.users.id | User who created the study. | 3 | |
| legacy | enum('y','n') | No | 'y' | Whether this is a legacy study. | y |
study_tag
Tags applied to studies, with optional time windows and access control. It is unclear why this table has a different design than the other tag tables, but it likely has to do with the "legacy" study setup.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| study_id | int | No | study.id | Study ID. | 28 | |
| tag | varchar(255) | No | Tag string. | published, jmlr | ||
| window_start | datetime | Yes | NULL | Start of the tag's validity window. | 2020-01-01 12:23:42 | |
| window_end | datetime | Yes | NULL | End of the tag's validity window. | 2021-04-11 19:54:24 | |
| write_access | enum('private','public') | No | 'private' | Who can add this tag. | private |
run_study
A join table for the "new" style studies. Associates runs with studies (many-to-many).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| study_id | int | No | study.id | Study ID. | 2 | |
| run_id | int unsigned | No | run.rid | Run ID. | 102984 | |
| uploader | mediumint unsigned | No | openml.users.id | User who added the run to the study. | 1815 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the run was added. | 2024-05-12 23:10:54 |
task_study
A join table for the "new" style studies. Associates tasks with studies (many-to-many).
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| study_id | int | No | study.id | Study ID. | 5 | |
| task_id | int | No | task.task_id | Task ID. | 18257 | |
| uploader | mediumint unsigned | No | openml.users.id | User who added the task to the study. | 12 | |
| date | timestamp | No | CURRENT_TIMESTAMP | When the task was added. | 2022-04-28 08:28:49 |
Community
The awarded_badges, likes, downvotes and downvote_reasons tables are not currently in use.
They may be added back with the new frontend (except for awarded_badges).
likes
Records user likes on datasets, flows, runs, or other entities.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| lid | int | No | auto_increment | Unique ID. | 1 | |
| knowledge_type | varchar(1) | No | Entity type code (e.g., 'd' for dataset). | d, f, r, t | ||
| knowledge_id | int | No | Entity ID. | 20248 | ||
| user_id | mediumint | No | User who liked the entity. | 241 | ||
| time | timestamp | No | CURRENT_TIMESTAMP | When the like was recorded. | 2019-09-08 17:47:57 |
downvotes
Records user downvotes on entities with a reason.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int | No | auto_increment | Unique ID. | ||
| knowledge_type | varchar(1) | No | Entity type code. | |||
| knowledge_id | int | No | Entity ID. | |||
| user_id | mediumint | No | User who downvoted. | |||
| reason | int | No | Reason for the downvote (references downvote_reasons). | |||
| time | timestamp | No | CURRENT_TIMESTAMP | When the downvote was recorded. | ||
| original | tinyint | No | 0 | Whether this is the original downvote (vs. an update). |
downvote_reasons
Defines the reasons for downvoting an entity.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| reason_id | int | No | auto_increment | Primary key. | ||
| description | varchar(256) | No | Description of the reason. |
downloads
Tracks download counts per user and entity.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| did | int | No | auto_increment | Unique ID. | ||
| knowledge_type | varchar(1) | No | Entity type code. | |||
| knowledge_id | int | No | Entity ID. | |||
| user_id | mediumint | No | User who downloaded. | |||
| count | smallint | No | 1 | Number of downloads. | ||
| time | timestamp | No | CURRENT_TIMESTAMP | Last download time. |
Miscellaneous
The notebook and pdnresults are no longer in use.
schedule
Defines scheduled experiment jobs to be executed. Seems to be out of use since 2017.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| sid | int | No | Setup ID. | |||
| task_id | int | No | Task ID to run. | |||
| experiment | varchar(128) | No | Experiment identifier. | |||
| active | enum('true','false') | No | 'true' | Whether the schedule is active. | ||
| last_assigned | datetime | Yes | NULL | When this job was last assigned to a worker. | ||
| ttid | int | No | Task type ID. | |||
| dependencies | varchar(128) | No | Dependency identifiers. | |||
| setup_string | text | No | Serialized setup configuration. |
kaggle
Maps OpenML datasets to Kaggle competitions or datasets. Defined by hand in collaboration with Kaggle. Only used by the frontend.
| Column | Type | Optional | Default | References | Description | Example |
|---|---|---|---|---|---|---|
| dataset_id | int | Yes | NULL | OpenML dataset ID. | 6 | |
| kaggle_link | varchar(500) | Yes | NULL | URL to the Kaggle page. | https://www.kaggle.com/datasets/nishan192/letterrecognition-using-svm |