openml.OpenMLDataset

class openml.OpenMLDataset(name, description, format=None, data_format='arff', dataset_id=None, version=None, creator=None, contributor=None, collection_date=None, upload_date=None, language=None, licence=None, url=None, default_target_attribute=None, row_id_attribute=None, ignore_attribute=None, version_label=None, citation=None, tag=None, visibility=None, original_data_url=None, paper_url=None, update_comment=None, md5_checksum=None, data_file=None, features=None, qualities=None, dataset=None)

Dataset object.

Allows fetching and uploading datasets to OpenML.

Parameters
namestr

Name of the dataset.

descriptionstr

Description of the dataset.

formatstr

Format of the dataset which can be either ‘arff’ or ‘sparse_arff’.

dataset_idint, optional

Id autogenerated by the server.

versionint, optional

Version of this dataset. ‘1’ for original version. Auto-incremented by server.

creatorstr, optional

The person who created the dataset.

contributorstr, optional

People who contributed to the current version of the dataset.

collection_datestr, optional

The date the data was originally collected, given by the uploader.

upload_datestr, optional

The date-time when the dataset was uploaded, generated by server.

languagestr, optional

Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. ‘English’.

licencestr, optional

License of the data.

urlstr, optional

Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.

default_target_attributestr, optional

The default target attribute, if it exists. Can have multiple values, comma separated.

row_id_attributestr, optional

The attribute that represents the row-id column, if present in the dataset.

ignore_attributestr | list, optional

Attributes that should be excluded in modelling, such as identifiers and indexes.

version_labelstr, optional

Version label provided by user. Can be a date, hash, or some other type of id.

citationstr, optional

Reference(s) that should be cited when building on this data.

tagstr, optional

Tags, describing the algorithms.

visibilitystr, optional

Who can see the dataset. Typical values: ‘Everyone’,’All my friends’,’Only me’. Can also be any of the user’s circles.

original_data_urlstr, optional

For derived data, the url to the original dataset.

paper_urlstr, optional

Link to a paper describing the dataset.

update_commentstr, optional

An explanation for when the dataset is uploaded.

statusstr, optional

Whether the dataset is active.

md5_checksumstr, optional

MD5 checksum to check if the dataset is downloaded without corruption.

data_filestr, optional

Path to where the dataset is located.

featuresdict, optional

A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.

qualitiesdict, optional

A dictionary of dataset qualities, which maps a quality name to a quality value.

dataset: string, optional

Serialized arff dataset string.

get_data(self, target: Union[List[str], str, NoneType] = None, include_row_id: bool = False, include_ignore_attribute: bool = False, dataset_format: str = 'dataframe') → Tuple[Union[numpy.ndarray, pandas.core.frame.DataFrame, scipy.sparse.csr.csr_matrix], Union[numpy.ndarray, pandas.core.frame.DataFrame, NoneType], List[bool], List[str]]

Returns dataset content as dataframes or sparse matrices.

Parameters
targetstring, List[str] or None (default=None)

Name of target column to separate from the data. Splitting multiple columns is currently not supported.

include_row_idboolean (default=False)

Whether to include row ids in the returned dataset.

include_ignore_attributeboolean (default=False)

Whether to include columns that are marked as “ignore” on the server in the dataset.

dataset_formatstring (default=’dataframe’)

The format of returned dataset. If array, the returned dataset will be a NumPy array or a SciPy sparse matrix. If dataframe, the returned dataset will be a Pandas DataFrame or SparseDataFrame.

Returns
Xndarray, dataframe, or sparse matrix, shape (n_samples, n_columns)

Dataset

yndarray or pd.Series, shape (n_samples, ) or None

Target column

categorical_indicatorboolean ndarray

Mask that indicate categorical features.

attribute_namesList[str]

List of attribute names.

get_features_by_type(self, data_type, exclude=None, exclude_ignore_attribute=True, exclude_row_id_attribute=True)

Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.

Parameters
data_typestr

The data type to return (e.g., nominal, numeric, date, string)

excludelist(int)
Indices to exclude (and adapt the return values as if these indices

are not present)

exclude_ignore_attributebool

Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)

exclude_row_id_attributebool

Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)

Returns
resultlist

a list of indices that have the specified data type

publish(self)

Publish the dataset on the OpenML server.

Upload the dataset description and dataset content to openml.

Returns
dataset_id: int

Id of the dataset uploaded to the server.

push_tag(self, tag)

Annotates this data set with a tag on the server.

Parameters
tagstr

Tag to attach to the dataset.

remove_tag(self, tag)

Removes a tag from this dataset on the server.

Parameters
tagstr

Tag to attach to the dataset.

retrieve_class_labels(self, target_name: str = 'class') → Union[NoneType, List[str]]

Reads the datasets arff to determine the class-labels.

If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.

Parameters
target_namestr

Name of the target attribute

Returns
list