`openml.datasets`.OpenMLDataset¶

Dataset object.

Allows fetching and uploading datasets to OpenML.

Parameters:

namestr: Name of the dataset.
descriptionstr: Description of the dataset.
data_formatstr: Format of the dataset which can be either ‘arff’ or ‘sparse_arff’.
cache_formatstr: Format for caching the dataset which can be either ‘feather’ or ‘pickle’.
dataset_idint, optional: Id autogenerated by the server.
versionint, optional: Version of this dataset. ‘1’ for original version. Auto-incremented by server.
creatorstr, optional: The person who created the dataset.
contributorstr, optional: People who contributed to the current version of the dataset.
collection_datestr, optional: The date the data was originally collected, given by the uploader.
upload_datestr, optional: The date-time when the dataset was uploaded, generated by server.
languagestr, optional: Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.
licencestr, optional: License of the data.
urlstr, optional: Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.
default_target_attributestr, optional: The default target attribute, if it exists. Can have multiple values, comma separated.
row_id_attributestr, optional: The attribute that represents the row-id column, if present in the dataset.
ignore_attributestr | list, optional: Attributes that should be excluded in modelling, such as identifiers and indexes.
version_labelstr, optional: Version label provided by user. Can be a date, hash, or some other type of id.
citationstr, optional: Reference(s) that should be cited when building on this data.
tagstr, optional: Tags, describing the algorithms.
visibilitystr, optional: Who can see the dataset. Typical values: ‘Everyone’,’All my friends’,’Only me’. Can also be any of the user’s circles.
original_data_urlstr, optional: For derived data, the url to the original dataset.
paper_urlstr, optional: Link to a paper describing the dataset.
update_commentstr, optional: An explanation for when the dataset is uploaded.
md5_checksumstr, optional: MD5 checksum to check if the dataset is downloaded without corruption.
data_filestr, optional: Path to where the dataset is located.
features_filedict, optional: A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.
qualities_filedict, optional: A dictionary of dataset qualities, which maps a quality name to a quality value.
dataset: string, optional: Serialized arff dataset string.
parquet_url: string, optional: This is the URL to the storage location where the dataset files are hosted. This can be a MinIO bucket URL. If specified, the data will be accessed from this URL when reading the files.
parquet_file: string, optional: Path to the local file.

property features: dict[int, OpenMLDataFeature]¶: Get the features of this dataset.

get_data(target: list[str] | str | None = None, include_row_id: bool = False, include_ignore_attribute: bool = False, dataset_format: Literal['array', 'dataframe'] = 'dataframe') → tuple[np.ndarray | pd.DataFrame | scipy.sparse.csr_matrix, np.ndarray | pd.DataFrame | None, list[bool], list[str]]¶

Returns dataset content as dataframes or sparse matrices.

Parameters:

targetstring, List[str] or None (default=None): Name of target column to separate from the data. Splitting multiple columns is currently not supported.
include_row_idboolean (default=False): Whether to include row ids in the returned dataset.
include_ignore_attributeboolean (default=False): Whether to include columns that are marked as “ignore” on the server in the dataset.
dataset_formatstring (default=’dataframe’): The format of returned dataset. If array, the returned dataset will be a NumPy array or a SciPy sparse matrix. Support for array will be removed in 0.15. If dataframe, the returned dataset will be a Pandas DataFrame.

Returns:

Xndarray, dataframe, or sparse matrix, shape (n_samples, n_columns): Dataset
yndarray or pd.Series, shape (n_samples, ) or None: Target column
categorical_indicatorboolean ndarray: Mask that indicate categorical features.
attribute_namesList[str]: List of attribute names.

get_features_by_type(data_type: str, exclude: list[str] | None = None, exclude_ignore_attribute: bool = True, exclude_row_id_attribute: bool = True) → list[int]¶

Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.

Parameters:

data_typestr: The data type to return (e.g., nominal, numeric, date, string)
excludelist(int): List of columns to exclude from the return value
exclude_ignore_attributebool: Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)
exclude_row_id_attributebool: Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)

Returns:

resultlist: a list of indices that have the specified data type

property id: int | None¶: Get the dataset numeric id.

open_in_browser() → None¶: Opens the OpenML web page corresponding to this object in your default browser.

property openml_url: str | None¶: The URL of the object on the server, if it was uploaded, else None.

publish() → OpenMLBase¶: Publish the object on the OpenML server.

push_tag(tag: str) → None¶

Annotates this entity with a tag on the server.

Parameters:

tagstr: Tag to attach to the flow.

property qualities: dict[str, float] | None¶: Get the qualities of this dataset.

remove_tag(tag: str) → None¶

Removes a tag from this entity on the server.

Parameters:

tagstr: Tag to attach to the flow.

retrieve_class_labels(target_name: str = 'class') → None | list[str]¶

Reads the datasets arff to determine the class-labels.

If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.

Parameters:

target_namestr: Name of the target attribute

Returns:

list

classmethod url_for_id(id_: int) → str¶: Return the OpenML URL for the object of the class entity with the given id.

openml.datasets.OpenMLDataset¶

`openml.datasets`.OpenMLDataset¶