openml.datasets
.OpenMLDataset¶
- class openml.datasets.OpenMLDataset(name: str, description: str | None, data_format: Literal['arff', 'sparse_arff'] = 'arff', cache_format: Literal['feather', 'pickle'] = 'pickle', dataset_id: int | None = None, version: int | None = None, creator: str | None = None, contributor: str | None = None, collection_date: str | None = None, upload_date: str | None = None, language: str | None = None, licence: str | None = None, url: str | None = None, default_target_attribute: str | None = None, row_id_attribute: str | None = None, ignore_attribute: str | list[str] | None = None, version_label: str | None = None, citation: str | None = None, tag: str | None = None, visibility: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, md5_checksum: str | None = None, data_file: str | None = None, features_file: str | None = None, qualities_file: str | None = None, dataset: str | None = None, parquet_url: str | None = None, parquet_file: str | None = None)¶
Dataset object.
Allows fetching and uploading datasets to OpenML.
- Parameters:
- namestr
Name of the dataset.
- descriptionstr
Description of the dataset.
- data_formatstr
Format of the dataset which can be either ‘arff’ or ‘sparse_arff’.
- cache_formatstr
Format for caching the dataset which can be either ‘feather’ or ‘pickle’.
- dataset_idint, optional
Id autogenerated by the server.
- versionint, optional
Version of this dataset. ‘1’ for original version. Auto-incremented by server.
- creatorstr, optional
The person who created the dataset.
- contributorstr, optional
People who contributed to the current version of the dataset.
- collection_datestr, optional
The date the data was originally collected, given by the uploader.
- upload_datestr, optional
The date-time when the dataset was uploaded, generated by server.
- languagestr, optional
Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.
- licencestr, optional
License of the data.
- urlstr, optional
Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.
- default_target_attributestr, optional
The default target attribute, if it exists. Can have multiple values, comma separated.
- row_id_attributestr, optional
The attribute that represents the row-id column, if present in the dataset.
- ignore_attributestr | list, optional
Attributes that should be excluded in modelling, such as identifiers and indexes.
- version_labelstr, optional
Version label provided by user. Can be a date, hash, or some other type of id.
- citationstr, optional
Reference(s) that should be cited when building on this data.
- tagstr, optional
Tags, describing the algorithms.
- visibilitystr, optional
Who can see the dataset. Typical values: ‘Everyone’,’All my friends’,’Only me’. Can also be any of the user’s circles.
- original_data_urlstr, optional
For derived data, the url to the original dataset.
- paper_urlstr, optional
Link to a paper describing the dataset.
- update_commentstr, optional
An explanation for when the dataset is uploaded.
- md5_checksumstr, optional
MD5 checksum to check if the dataset is downloaded without corruption.
- data_filestr, optional
Path to where the dataset is located.
- features_filedict, optional
A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.
- qualities_filedict, optional
A dictionary of dataset qualities, which maps a quality name to a quality value.
- dataset: string, optional
Serialized arff dataset string.
- parquet_url: string, optional
This is the URL to the storage location where the dataset files are hosted. This can be a MinIO bucket URL. If specified, the data will be accessed from this URL when reading the files.
- parquet_file: string, optional
Path to the local file.
- property features: dict[int, OpenMLDataFeature]¶
Get the features of this dataset.
- get_data(target: list[str] | str | None = None, include_row_id: bool = False, include_ignore_attribute: bool = False, dataset_format: Literal['array', 'dataframe'] = 'dataframe') tuple[np.ndarray | pd.DataFrame | scipy.sparse.csr_matrix, np.ndarray | pd.DataFrame | None, list[bool], list[str]] ¶
Returns dataset content as dataframes or sparse matrices.
- Parameters:
- targetstring, List[str] or None (default=None)
Name of target column to separate from the data. Splitting multiple columns is currently not supported.
- include_row_idboolean (default=False)
Whether to include row ids in the returned dataset.
- include_ignore_attributeboolean (default=False)
Whether to include columns that are marked as “ignore” on the server in the dataset.
- dataset_formatstring (default=’dataframe’)
The format of returned dataset. If
array
, the returned dataset will be a NumPy array or a SciPy sparse matrix. Support forarray
will be removed in 0.15. Ifdataframe
, the returned dataset will be a Pandas DataFrame.
- Returns:
- Xndarray, dataframe, or sparse matrix, shape (n_samples, n_columns)
Dataset
- yndarray or pd.Series, shape (n_samples, ) or None
Target column
- categorical_indicatorboolean ndarray
Mask that indicate categorical features.
- attribute_namesList[str]
List of attribute names.
- get_features_by_type(data_type: str, exclude: list[str] | None = None, exclude_ignore_attribute: bool = True, exclude_row_id_attribute: bool = True) list[int] ¶
Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.
- Parameters:
- data_typestr
The data type to return (e.g., nominal, numeric, date, string)
- excludelist(int)
List of columns to exclude from the return value
- exclude_ignore_attributebool
Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)
- exclude_row_id_attributebool
Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)
- Returns:
- resultlist
a list of indices that have the specified data type
- property id: int | None¶
Get the dataset numeric id.
- open_in_browser() None ¶
Opens the OpenML web page corresponding to this object in your default browser.
- property openml_url: str | None¶
The URL of the object on the server, if it was uploaded, else None.
- publish() OpenMLBase ¶
Publish the object on the OpenML server.
- push_tag(tag: str) None ¶
Annotates this entity with a tag on the server.
- Parameters:
- tagstr
Tag to attach to the flow.
- property qualities: dict[str, float] | None¶
Get the qualities of this dataset.
- remove_tag(tag: str) None ¶
Removes a tag from this entity on the server.
- Parameters:
- tagstr
Tag to attach to the flow.
- retrieve_class_labels(target_name: str = 'class') None | list[str] ¶
Reads the datasets arff to determine the class-labels.
If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.
- Parameters:
- target_namestr
Name of the target attribute
- Returns:
- list
- classmethod url_for_id(id_: int) str ¶
Return the OpenML URL for the object of the class entity with the given id.