openml.datasets.OpenMLDataset

class openml.datasets.OpenMLDataset(name: str, description: str | None, data_format: Literal['arff', 'sparse_arff'] = 'arff', cache_format: Literal['feather', 'pickle'] = 'pickle', dataset_id: int | None = None, version: int | None = None, creator: str | None = None, contributor: str | None = None, collection_date: str | None = None, upload_date: str | None = None, language: str | None = None, licence: str | None = None, url: str | None = None, default_target_attribute: str | None = None, row_id_attribute: str | None = None, ignore_attribute: str | list[str] | None = None, version_label: str | None = None, citation: str | None = None, tag: str | None = None, visibility: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, md5_checksum: str | None = None, data_file: str | None = None, features_file: str | None = None, qualities_file: str | None = None, dataset: str | None = None, parquet_url: str | None = None, parquet_file: str | None = None)

Dataset object.

Allows fetching and uploading datasets to OpenML.

Parameters:
namestr

Name of the dataset.

descriptionstr

Description of the dataset.

data_formatstr

Format of the dataset which can be either ‘arff’ or ‘sparse_arff’.

cache_formatstr

Format for caching the dataset which can be either ‘feather’ or ‘pickle’.

dataset_idint, optional

Id autogenerated by the server.

versionint, optional

Version of this dataset. ‘1’ for original version. Auto-incremented by server.

creatorstr, optional

The person who created the dataset.

contributorstr, optional

People who contributed to the current version of the dataset.

collection_datestr, optional

The date the data was originally collected, given by the uploader.

upload_datestr, optional

The date-time when the dataset was uploaded, generated by server.

languagestr, optional

Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.

licencestr, optional

License of the data.

urlstr, optional

Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.

default_target_attributestr, optional

The default target attribute, if it exists. Can have multiple values, comma separated.

row_id_attributestr, optional

The attribute that represents the row-id column, if present in the dataset.

ignore_attributestr | list, optional

Attributes that should be excluded in modelling, such as identifiers and indexes.

version_labelstr, optional

Version label provided by user. Can be a date, hash, or some other type of id.

citationstr, optional

Reference(s) that should be cited when building on this data.

tagstr, optional

Tags, describing the algorithms.

visibilitystr, optional

Who can see the dataset. Typical values: ‘Everyone’,’All my friends’,’Only me’. Can also be any of the user’s circles.

original_data_urlstr, optional

For derived data, the url to the original dataset.

paper_urlstr, optional

Link to a paper describing the dataset.

update_commentstr, optional

An explanation for when the dataset is uploaded.

md5_checksumstr, optional

MD5 checksum to check if the dataset is downloaded without corruption.

data_filestr, optional

Path to where the dataset is located.

features_filedict, optional

A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.

qualities_filedict, optional

A dictionary of dataset qualities, which maps a quality name to a quality value.

dataset: string, optional

Serialized arff dataset string.

parquet_url: string, optional

This is the URL to the storage location where the dataset files are hosted. This can be a MinIO bucket URL. If specified, the data will be accessed from this URL when reading the files.

parquet_file: string, optional

Path to the local file.

property features: dict[int, OpenMLDataFeature]

Get the features of this dataset.

get_data(target: list[str] | str | None = None, include_row_id: bool = False, include_ignore_attribute: bool = False, dataset_format: Literal['array', 'dataframe'] = 'dataframe') tuple[np.ndarray | pd.DataFrame | scipy.sparse.csr_matrix, np.ndarray | pd.DataFrame | None, list[bool], list[str]]

Returns dataset content as dataframes or sparse matrices.

Parameters:
targetstring, List[str] or None (default=None)

Name of target column to separate from the data. Splitting multiple columns is currently not supported.

include_row_idboolean (default=False)

Whether to include row ids in the returned dataset.

include_ignore_attributeboolean (default=False)

Whether to include columns that are marked as “ignore” on the server in the dataset.

dataset_formatstring (default=’dataframe’)

The format of returned dataset. If array, the returned dataset will be a NumPy array or a SciPy sparse matrix. Support for array will be removed in 0.15. If dataframe, the returned dataset will be a Pandas DataFrame.

Returns:
Xndarray, dataframe, or sparse matrix, shape (n_samples, n_columns)

Dataset

yndarray or pd.Series, shape (n_samples, ) or None

Target column

categorical_indicatorboolean ndarray

Mask that indicate categorical features.

attribute_namesList[str]

List of attribute names.

get_features_by_type(data_type: str, exclude: list[str] | None = None, exclude_ignore_attribute: bool = True, exclude_row_id_attribute: bool = True) list[int]

Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.

Parameters:
data_typestr

The data type to return (e.g., nominal, numeric, date, string)

excludelist(int)

List of columns to exclude from the return value

exclude_ignore_attributebool

Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)

exclude_row_id_attributebool

Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)

Returns:
resultlist

a list of indices that have the specified data type

property id: int | None

Get the dataset numeric id.

open_in_browser() None

Opens the OpenML web page corresponding to this object in your default browser.

property openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

publish() OpenMLBase

Publish the object on the OpenML server.

push_tag(tag: str) None

Annotates this entity with a tag on the server.

Parameters:
tagstr

Tag to attach to the flow.

property qualities: dict[str, float] | None

Get the qualities of this dataset.

remove_tag(tag: str) None

Removes a tag from this entity on the server.

Parameters:
tagstr

Tag to attach to the flow.

retrieve_class_labels(target_name: str = 'class') None | list[str]

Reads the datasets arff to determine the class-labels.

If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.

Parameters:
target_namestr

Name of the target attribute

Returns:
list
classmethod url_for_id(id_: int) str

Return the OpenML URL for the object of the class entity with the given id.