openml.datasets.get_dataset

openml.datasets.get_dataset(dataset_id: int | str, download_data: bool | None = None, version: int | None = None, error_if_multiple: bool = False, cache_format: Literal['pickle', 'feather'] = 'pickle', download_qualities: bool | None = None, download_features_meta_data: bool | None = None, download_all_files: bool = False, force_refresh_cache: bool = False) OpenMLDataset

Download the OpenML dataset representation, optionally also download actual data file.

This function is by default NOT thread/multiprocessing safe, as this function uses caching. A check will be performed to determine if the information has previously been downloaded to a cache, and if so be loaded from disk instead of retrieved from the server.

To make this function thread safe, you can install the python package oslo.concurrency. If oslo.concurrency is installed get_dataset becomes thread safe.

Alternatively, to make this function thread/multiprocessing safe initialize the cache first by calling get_dataset(args) once before calling get_dataset(args) many times in parallel. This will initialize the cache and later calls will use the cache in a thread/multiprocessing safe way.

If dataset is retrieved by name, a version may be specified. If no version is specified and multiple versions of the dataset exist, the earliest version of the dataset that is still active will be returned. If no version is specified, multiple versions of the dataset exist and exception_if_multiple is set to True, this function will raise an exception.

Parameters:
dataset_idint or str

Dataset ID of the dataset to download

download_databool (default=True)

If True, also download the data file. Beware that some datasets are large and it might make the operation noticeably slower. Metadata is also still retrieved. If False, create the OpenMLDataset and only populate it with the metadata. The data may later be retrieved through the OpenMLDataset.get_data method.

versionint, optional (default=None)

Specifies the version if dataset_id is specified by name. If no version is specified, retrieve the least recent still active version.

error_if_multiplebool (default=False)

If True raise an error if multiple datasets are found with matching criteria.

cache_formatstr (default=’pickle’) in {‘pickle’, ‘feather’}

Format for caching the dataset - may be feather or pickle Note that the default ‘pickle’ option may load slower than feather when no.of.rows is very high.

download_qualitiesbool (default=True)

Option to download ‘qualities’ meta-data in addition to the minimal dataset description. If True, download and cache the qualities file. If False, create the OpenMLDataset without qualities metadata. The data may later be added to the OpenMLDataset through the OpenMLDataset.load_metadata(qualities=True) method.

download_features_meta_databool (default=True)

Option to download ‘features’ meta-data in addition to the minimal dataset description. If True, download and cache the features file. If False, create the OpenMLDataset without features metadata. The data may later be added to the OpenMLDataset through the OpenMLDataset.load_metadata(features=True) method.

download_all_files: bool (default=False)

EXPERIMENTAL. Download all files related to the dataset that reside on the server. Useful for datasets which refer to auxiliary files (e.g., meta-album).

force_refresh_cachebool (default=False)

Force the cache to refreshed by deleting the cache directory and re-downloading the data. Note, if force_refresh_cache is True, get_dataset is NOT thread/multiprocessing safe, because this creates a race condition to creating and deleting the cache; as in general with the cache.

Returns:
datasetopenml.OpenMLDataset

The downloaded dataset.