`openml.datasets`.OpenMLDataset¶

class openml.datasets.OpenMLDataset(name, description, data_format='arff', cache_format='pickle', dataset_id=None, version=None, creator=None, contributor=None, collection_date=None, upload_date=None, language=None, licence=None, url=None, default_target_attribute=None, row_id_attribute=None, ignore_attribute=None, version_label=None, citation=None, tag=None, visibility=None, original_data_url=None, paper_url=None, update_comment=None, md5_checksum=None, data_file=None, features_file: str | None = None, qualities_file: str | None = None, dataset=None, minio_url: str | None = None, parquet_file: str | None = None)¶

Dataset object.

Allows fetching and uploading datasets to OpenML.

Parameters:

namestr: Name of the dataset.
descriptionstr: Description of the dataset.
data_formatstr: Format of the dataset which can be either ‘arff’ or ‘sparse_arff’.
cache_formatstr: Format for caching the dataset which can be either ‘feather’ or ‘pickle’.
dataset_idint, optional: Id autogenerated by the server.
versionint, optional: Version of this dataset. ‘1’ for original version. Auto-incremented by server.
creatorstr, optional: The person who created the dataset.
contributorstr, optional: People who contributed to the current version of the dataset.
collection_datestr, optional: The date the data was originally collected, given by the uploader.
upload_datestr, optional: The date-time when the dataset was uploaded, generated by server.
languagestr, optional: Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.
licencestr, optional: License of the data.
urlstr, optional: Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.
default_target_attributestr, optional: The default target attribute, if it exists. Can have multiple values, comma separated.
row_id_attributestr, optional: The attribute that represents the row-id column, if present in the dataset.
ignore_attributestr | list, optional: Attributes that should be excluded in modelling, such as identifiers and indexes.
version_labelstr, optional: Version label provided by user. Can be a date, hash, or some other type of id.
citationstr, optional: Reference(s) that should be cited when building on this data.
tagstr, optional: Tags, describing the algorithms.
visibilitystr, optional: Who can see the dataset. Typical values: ‘Everyone’,’All my friends’,’Only me’. Can also be any of the user’s circles.
original_data_urlstr, optional: For derived data, the url to the original dataset.
paper_urlstr, optional: Link to a paper describing the dataset.
update_commentstr, optional: An explanation for when the dataset is uploaded.
md5_checksumstr, optional: MD5 checksum to check if the dataset is downloaded without corruption.
data_filestr, optional: Path to where the dataset is located.
featuresdict, optional: A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.
qualitiesdict, optional: A dictionary of dataset qualities, which maps a quality name to a quality value.
dataset: string, optional: Serialized arff dataset string.
minio_url: string, optional: URL to the MinIO bucket with dataset files
parquet_file: string, optional: Path to the local parquet file.

Returns dataset content as dataframes or sparse matrices.

Parameters:

targetstring, List[str] or None (default=None): Name of target column to separate from the data. Splitting multiple columns is currently not supported.
include_row_idboolean (default=False): Whether to include row ids in the returned dataset.
include_ignore_attributeboolean (default=False): Whether to include columns that are marked as “ignore” on the server in the dataset.
dataset_formatstring (default=’dataframe’): The format of returned dataset. If array, the returned dataset will be a NumPy array or a SciPy sparse matrix. Support for array will be removed in 0.15. If dataframe, the returned dataset will be a Pandas DataFrame.

Returns:

Xndarray, dataframe, or sparse matrix, shape (n_samples, n_columns): Dataset
yndarray or pd.Series, shape (n_samples, ) or None: Target column
categorical_indicatorboolean ndarray: Mask that indicate categorical features.
attribute_namesList[str]: List of attribute names.

get_features_by_type(data_type, exclude=None, exclude_ignore_attribute=True, exclude_row_id_attribute=True)¶

Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.

Parameters:

data_typestr

The data type to return (e.g., nominal, numeric, date, string)

excludelist(int)

Indices to exclude (and adapt the return values as if these indices: are not present)

exclude_ignore_attributebool

Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)

exclude_row_id_attributebool

Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)

Returns:

resultlist: a list of indices that have the specified data type

property id: int | None¶: The id of the entity, it is unique for its entity type.

open_in_browser()¶: Opens the OpenML web page corresponding to this object in your default browser.

property openml_url: str | None¶: The URL of the object on the server, if it was uploaded, else None.

push_tag(tag: str)¶

Annotates this entity with a tag on the server.

Parameters:

tagstr: Tag to attach to the flow.

remove_tag(tag: str)¶

Removes a tag from this entity on the server.

Parameters:

tagstr: Tag to attach to the flow.

retrieve_class_labels(target_name: str = 'class') → None | List[str]¶

Reads the datasets arff to determine the class-labels.

If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.

Parameters:

target_namestr: Name of the target attribute

Returns:

list

classmethod url_for_id(id_: int) → str¶: Return the OpenML URL for the object of the class entity with the given id.

openml.datasets.OpenMLDataset¶

`openml.datasets`.OpenMLDataset¶