openml.datasets
.create_dataset¶
- openml.datasets.create_dataset(name: str, description: str | None, creator: str | None, contributor: str | None, collection_date: str | None, language: str | None, licence: str | None, attributes: list[tuple[str, str | list[str]]] | dict[str, str | list[str]] | Literal['auto'], data: pd.DataFrame | np.ndarray | scipy.sparse.coo_matrix, default_target_attribute: str, ignore_attribute: str | list[str] | None, citation: str, row_id_attribute: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, version_label: str | None = None) OpenMLDataset ¶
Create a dataset.
This function creates an OpenMLDataset object. The OpenMLDataset object contains information related to the dataset and the actual data file.
- Parameters:
- namestr
Name of the dataset.
- descriptionstr
Description of the dataset.
- creatorstr
The person who created the dataset.
- contributorstr
People who contributed to the current version of the dataset.
- collection_datestr
The date the data was originally collected, given by the uploader.
- languagestr
Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.
- licencestr
License of the data.
- attributeslist, dict, or ‘auto’
A list of tuples. Each tuple consists of the attribute name and type. If passing a pandas DataFrame, the attributes can be automatically inferred by passing
'auto'
. Specific attributes can be manually specified by a passing a dictionary where the key is the name of the attribute and the value is the data type of the attribute.- datandarray, list, dataframe, coo_matrix, shape (n_samples, n_features)
An array that contains both the attributes and the targets. When providing a dataframe, the attribute names and type can be inferred by passing
attributes='auto'
. The target feature is indicated as meta-data of the dataset.- default_target_attributestr
The default target attribute, if it exists. Can have multiple values, comma separated.
- ignore_attributestr | list
Attributes that should be excluded in modelling, such as identifiers and indexes. Can have multiple values, comma separated.
- citationstr
Reference(s) that should be cited when building on this data.
- version_labelstr, optional
- Version label provided by user.
Can be a date, hash, or some other type of id.
- row_id_attributestr, optional
The attribute that represents the row-id column, if present in the dataset. If
data
is a dataframe androw_id_attribute
is not specified, the index of the dataframe will be used as therow_id_attribute
. If the name of the index isNone
, it will be discarded.- original_data_urlstr, optional
For derived data, the url to the original dataset.
- paper_urlstr, optional
Link to a paper describing the dataset.
- update_commentstr, optional
An explanation for when the dataset is uploaded.
- Returns:
- class:openml.OpenMLDataset
- Dataset description.