openml.datasets.create_dataset(name, description, creator, contributor, collection_date, language, licence, attributes, data, default_target_attribute, ignore_attribute, citation, row_id_attribute=None, original_data_url=None, paper_url=None, update_comment=None, version_label=None)

Create a dataset.

This function creates an OpenMLDataset object. The OpenMLDataset object contains information related to the dataset and the actual data file.


Name of the dataset.


Description of the dataset.


The person who created the dataset.


People who contributed to the current version of the dataset.


The date the data was originally collected, given by the uploader.


Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.


License of the data.

attributeslist, dict, or ‘auto’

A list of tuples. Each tuple consists of the attribute name and type. If passing a pandas DataFrame, the attributes can be automatically inferred by passing 'auto'. Specific attributes can be manually specified by a passing a dictionary where the key is the name of the attribute and the value is the data type of the attribute.

datandarray, list, dataframe, coo_matrix, shape (n_samples, n_features)

An array that contains both the attributes and the targets. When providing a dataframe, the attribute names and type can be inferred by passing attributes='auto'. The target feature is indicated as meta-data of the dataset.


The default target attribute, if it exists. Can have multiple values, comma separated.

ignore_attributestr | list

Attributes that should be excluded in modelling, such as identifiers and indexes. Can have multiple values, comma separated.


Reference(s) that should be cited when building on this data.

version_labelstr, optional
Version label provided by user.

Can be a date, hash, or some other type of id.

row_id_attributestr, optional

The attribute that represents the row-id column, if present in the dataset. If data is a dataframe and row_id_attribute is not specified, the index of the dataframe will be used as the row_id_attribute. If the name of the index is None, it will be discarded.

original_data_urlstr, optional

For derived data, the url to the original dataset.

paper_urlstr, optional

Link to a paper describing the dataset.

update_commentstr, optional

An explanation for when the dataset is uploaded.


Dataset description.