openml.datasets.create_dataset

openml.datasets.create_dataset(name: str, description: str | None, creator: str | None, contributor: str | None, collection_date: str | None, language: str | None, licence: str | None, attributes: list[tuple[str, str | list[str]]] | dict[str, str | list[str]] | Literal['auto'], data: pd.DataFrame | np.ndarray | scipy.sparse.coo_matrix, default_target_attribute: str, ignore_attribute: str | list[str] | None, citation: str, row_id_attribute: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, version_label: str | None = None) OpenMLDataset

Create a dataset.

This function creates an OpenMLDataset object. The OpenMLDataset object contains information related to the dataset and the actual data file.

Parameters:
namestr

Name of the dataset.

descriptionstr

Description of the dataset.

creatorstr

The person who created the dataset.

contributorstr

People who contributed to the current version of the dataset.

collection_datestr

The date the data was originally collected, given by the uploader.

languagestr

Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.

licencestr

License of the data.

attributeslist, dict, or ‘auto’

A list of tuples. Each tuple consists of the attribute name and type. If passing a pandas DataFrame, the attributes can be automatically inferred by passing 'auto'. Specific attributes can be manually specified by a passing a dictionary where the key is the name of the attribute and the value is the data type of the attribute.

datandarray, list, dataframe, coo_matrix, shape (n_samples, n_features)

An array that contains both the attributes and the targets. When providing a dataframe, the attribute names and type can be inferred by passing attributes='auto'. The target feature is indicated as meta-data of the dataset.

default_target_attributestr

The default target attribute, if it exists. Can have multiple values, comma separated.

ignore_attributestr | list

Attributes that should be excluded in modelling, such as identifiers and indexes. Can have multiple values, comma separated.

citationstr

Reference(s) that should be cited when building on this data.

version_labelstr, optional
Version label provided by user.

Can be a date, hash, or some other type of id.

row_id_attributestr, optional

The attribute that represents the row-id column, if present in the dataset. If data is a dataframe and row_id_attribute is not specified, the index of the dataframe will be used as the row_id_attribute. If the name of the index is None, it will be discarded.

original_data_urlstr, optional

For derived data, the url to the original dataset.

paper_urlstr, optional

Link to a paper describing the dataset.

update_commentstr, optional

An explanation for when the dataset is uploaded.

Returns:
class:openml.OpenMLDataset
Dataset description.