openml.datasets
.OpenMLDataset¶
- class openml.datasets.OpenMLDataset(name, description, data_format='arff', cache_format='pickle', dataset_id=None, version=None, creator=None, contributor=None, collection_date=None, upload_date=None, language=None, licence=None, url=None, default_target_attribute=None, row_id_attribute=None, ignore_attribute=None, version_label=None, citation=None, tag=None, visibility=None, original_data_url=None, paper_url=None, update_comment=None, md5_checksum=None, data_file=None, features_file: str | None = None, qualities_file: str | None = None, dataset=None, minio_url: str | None = None, parquet_file: str | None = None)¶
Dataset object.
Allows fetching and uploading datasets to OpenML.
- Parameters:
- namestr
Name of the dataset.
- descriptionstr
Description of the dataset.
- data_formatstr
Format of the dataset which can be either ‘arff’ or ‘sparse_arff’.
- cache_formatstr
Format for caching the dataset which can be either ‘feather’ or ‘pickle’.
- dataset_idint, optional
Id autogenerated by the server.
- versionint, optional
Version of this dataset. ‘1’ for original version. Auto-incremented by server.
- creatorstr, optional
The person who created the dataset.
- contributorstr, optional
People who contributed to the current version of the dataset.
- collection_datestr, optional
The date the data was originally collected, given by the uploader.
- upload_datestr, optional
The date-time when the dataset was uploaded, generated by server.
- languagestr, optional
Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.
- licencestr, optional
License of the data.
- urlstr, optional
Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.
- default_target_attributestr, optional
The default target attribute, if it exists. Can have multiple values, comma separated.
- row_id_attributestr, optional
The attribute that represents the row-id column, if present in the dataset.
- ignore_attributestr | list, optional
Attributes that should be excluded in modelling, such as identifiers and indexes.
- version_labelstr, optional
Version label provided by user. Can be a date, hash, or some other type of id.
- citationstr, optional
Reference(s) that should be cited when building on this data.
- tagstr, optional
Tags, describing the algorithms.
- visibilitystr, optional
Who can see the dataset. Typical values: ‘Everyone’,’All my friends’,’Only me’. Can also be any of the user’s circles.
- original_data_urlstr, optional
For derived data, the url to the original dataset.
- paper_urlstr, optional
Link to a paper describing the dataset.
- update_commentstr, optional
An explanation for when the dataset is uploaded.
- md5_checksumstr, optional
MD5 checksum to check if the dataset is downloaded without corruption.
- data_filestr, optional
Path to where the dataset is located.
- featuresdict, optional
A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.
- qualitiesdict, optional
A dictionary of dataset qualities, which maps a quality name to a quality value.
- dataset: string, optional
Serialized arff dataset string.
- minio_url: string, optional
URL to the MinIO bucket with dataset files
- parquet_file: string, optional
Path to the local parquet file.
- get_data(target: List[str] | str | None = None, include_row_id: bool = False, include_ignore_attribute: bool = False, dataset_format: str = 'dataframe') Tuple[ndarray | DataFrame | csr_matrix, ndarray | DataFrame | None, List[bool], List[str]] ¶
Returns dataset content as dataframes or sparse matrices.
- Parameters:
- targetstring, List[str] or None (default=None)
Name of target column to separate from the data. Splitting multiple columns is currently not supported.
- include_row_idboolean (default=False)
Whether to include row ids in the returned dataset.
- include_ignore_attributeboolean (default=False)
Whether to include columns that are marked as “ignore” on the server in the dataset.
- dataset_formatstring (default=’dataframe’)
The format of returned dataset. If
array
, the returned dataset will be a NumPy array or a SciPy sparse matrix. Support forarray
will be removed in 0.15. Ifdataframe
, the returned dataset will be a Pandas DataFrame.
- Returns:
- Xndarray, dataframe, or sparse matrix, shape (n_samples, n_columns)
Dataset
- yndarray or pd.Series, shape (n_samples, ) or None
Target column
- categorical_indicatorboolean ndarray
Mask that indicate categorical features.
- attribute_namesList[str]
List of attribute names.
- get_features_by_type(data_type, exclude=None, exclude_ignore_attribute=True, exclude_row_id_attribute=True)¶
Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.
- Parameters:
- data_typestr
The data type to return (e.g., nominal, numeric, date, string)
- excludelist(int)
- Indices to exclude (and adapt the return values as if these indices
are not present)
- exclude_ignore_attributebool
Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)
- exclude_row_id_attributebool
Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)
- Returns:
- resultlist
a list of indices that have the specified data type
- property id: int | None¶
The id of the entity, it is unique for its entity type.
- open_in_browser()¶
Opens the OpenML web page corresponding to this object in your default browser.
- property openml_url: str | None¶
The URL of the object on the server, if it was uploaded, else None.
- push_tag(tag: str)¶
Annotates this entity with a tag on the server.
- Parameters:
- tagstr
Tag to attach to the flow.
- remove_tag(tag: str)¶
Removes a tag from this entity on the server.
- Parameters:
- tagstr
Tag to attach to the flow.
- retrieve_class_labels(target_name: str = 'class') None | List[str] ¶
Reads the datasets arff to determine the class-labels.
If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.
- Parameters:
- target_namestr
Name of the target attribute
- Returns:
- list
- classmethod url_for_id(id_: int) str ¶
Return the OpenML URL for the object of the class entity with the given id.