Datasets¶

How to list and download datasets.

# License: BSD 3-Clauses

import openml
import pandas as pd
from openml.datasets import edit_dataset, fork_dataset, get_dataset

Exercise 0¶

List datasets
- Use the output_format parameter to select output type
- Default gives ‘dict’ (other option: ‘dataframe’, see below)

Note: list_datasets will return a pandas dataframe by default from 0.15. When using openml-python 0.14, list_datasets will warn you to use output_format=’dataframe’.

datalist = openml.datasets.list_datasets(output_format="dataframe")
datalist = datalist[["did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses"]]

print(f"First 10 of {len(datalist)} datasets...")
datalist.head(n=10)

# The same can be done with lesser lines of code
openml_df = openml.datasets.list_datasets(output_format="dataframe")
openml_df.head(n=10)

First 10 of 5759 datasets...

	did	name	version	uploader	status	format	MajorityClassSize	MaxNominalAttDistinctValues	MinorityClassSize	NumberOfClasses	NumberOfFeatures	NumberOfInstances	NumberOfInstancesWithMissingValues	NumberOfMissingValues	NumberOfNumericFeatures	NumberOfSymbolicFeatures
2	2	anneal	1	1	active	ARFF	684.0	7.0	8.0	5.0	39.0	898.0	898.0	22175.0	6.0	33.0
3	3	kr-vs-kp	1	1	active	ARFF	1669.0	3.0	1527.0	2.0	37.0	3196.0	0.0	0.0	0.0	37.0
4	4	labor	1	1	active	ARFF	37.0	3.0	20.0	2.0	17.0	57.0	56.0	326.0	8.0	9.0
5	5	arrhythmia	1	1	active	ARFF	245.0	13.0	2.0	13.0	280.0	452.0	384.0	408.0	206.0	74.0
6	6	letter	1	1	active	ARFF	813.0	26.0	734.0	26.0	17.0	20000.0	0.0	0.0	16.0	1.0
7	7	audiology	1	1	active	ARFF	57.0	24.0	1.0	24.0	70.0	226.0	222.0	317.0	0.0	70.0
8	8	liver-disorders	1	1	active	ARFF	NaN	NaN	NaN	0.0	6.0	345.0	0.0	0.0	6.0	0.0
9	9	autos	1	1	active	ARFF	67.0	22.0	3.0	6.0	26.0	205.0	46.0	59.0	15.0	11.0
10	10	lymph	1	1	active	ARFF	81.0	8.0	2.0	4.0	19.0	148.0	0.0	0.0	3.0	16.0
11	11	balance-scale	1	1	active	ARFF	288.0	3.0	49.0	3.0	5.0	625.0	0.0	0.0	4.0	1.0

Exercise 1¶

Find datasets with more than 10000 examples.
Find a dataset called ‘eeg_eye_state’.
Find all datasets with more than 50 classes.

datalist[datalist.NumberOfInstances > 10000].sort_values(["NumberOfInstances"]).head(n=20)
""
datalist.query('name == "eeg-eye-state"')
""
datalist.query("NumberOfClasses > 50")

	did	name	NumberOfInstances	NumberOfFeatures	NumberOfClasses
1491	1491	one-hundred-plants-margin	1600.0	65.0	100.0
1492	1492	one-hundred-plants-shape	1600.0	65.0	100.0
1493	1493	one-hundred-plants-texture	1599.0	65.0	100.0
4552	4552	BachChoralHarmony	5665.0	17.0	102.0
41167	41167	dionis	416188.0	61.0	355.0
41169	41169	helena	65196.0	28.0	100.0
41960	41960	seattlecrime6	523590.0	8.0	144.0
41983	41983	CIFAR-100	60000.0	3073.0	100.0
42078	42078	beer_reviews	1586614.0	13.0	104.0
42087	42087	beer_reviews	1586614.0	13.0	104.0
42088	42088	beer_reviews	1586614.0	13.0	104.0
42089	42089	vancouver_employee	1586614.0	13.0	104.0
42123	42123	article_influence	3615.0	7.0	3169.0
42223	42223	dataset-autoHorse_fixed	201.0	69.0	186.0
42396	42396	aloi	108000.0	129.0	1000.0
43723	43723	Toronto-Apartment-Rental-Price	1124.0	7.0	188.0
44282	44282	Meta_Album_PLK_Mini	3440.0	3.0	86.0
44283	44283	Meta_Album_FLW_Mini	4080.0	3.0	102.0
44284	44284	Meta_Album_SPT_Mini	2920.0	3.0	73.0
44285	44285	Meta_Album_BRD_Mini	12600.0	3.0	315.0
44288	44288	Meta_Album_TEX_Mini	2560.0	3.0	64.0
44289	44289	Meta_Album_CRS_Mini	7840.0	3.0	196.0
44292	44292	Meta_Album_INS_2_Mini	4080.0	3.0	102.0
44298	44298	Meta_Album_DOG_Mini	4800.0	3.0	120.0
44304	44304	Meta_Album_TEX_ALOT_Mini	10000.0	3.0	250.0
44306	44306	Meta_Album_INS_Mini	4160.0	3.0	104.0
44317	44317	Meta_Album_PLK_Extended	473273.0	3.0	102.0
44318	44318	Meta_Album_FLW_Extended	8189.0	3.0	102.0
44319	44319	Meta_Album_SPT_Extended	10416.0	3.0	73.0
44320	44320	Meta_Album_BRD_Extended	49054.0	3.0	315.0
44322	44322	Meta_Album_TEX_Extended	8675.0	3.0	64.0
44323	44323	Meta_Album_CRS_Extended	16185.0	3.0	196.0
44326	44326	Meta_Album_INS_2_Extended	75222.0	3.0	102.0
44331	44331	Meta_Album_DOG_Extended	20480.0	3.0	120.0
44337	44337	Meta_Album_TEX_ALOT_Extended	25000.0	3.0	250.0
44340	44340	Meta_Album_INS_Extended	170506.0	3.0	117.0
44533	44533	dionis_seed_0_nrows_2000_nclasses_10_ncols_100...	2000.0	61.0	355.0
44534	44534	dionis_seed_1_nrows_2000_nclasses_10_ncols_100...	2000.0	61.0	355.0
44535	44535	dionis_seed_2_nrows_2000_nclasses_10_ncols_100...	2000.0	61.0	355.0
44536	44536	dionis_seed_3_nrows_2000_nclasses_10_ncols_100...	2000.0	61.0	355.0
44537	44537	dionis_seed_4_nrows_2000_nclasses_10_ncols_100...	2000.0	61.0	355.0
44728	44728	helena_seed_0_nrows_2000_nclasses_10_ncols_100...	2000.0	28.0	100.0
44729	44729	helena_seed_1_nrows_2000_nclasses_10_ncols_100...	2000.0	28.0	100.0
44730	44730	helena_seed_2_nrows_2000_nclasses_10_ncols_100...	2000.0	28.0	100.0
44731	44731	helena_seed_3_nrows_2000_nclasses_10_ncols_100...	2000.0	28.0	100.0
44732	44732	helena_seed_4_nrows_2000_nclasses_10_ncols_100...	2000.0	28.0	100.0
45049	45049	MD_MIX_Mini_Copy	28240.0	69.0	706.0
45102	45102	dailybike	731.0	13.0	606.0
45103	45103	dailybike	731.0	13.0	606.0
45104	45104	PLK_Mini_Copy	3440.0	3.0	86.0
45274	45274	PASS	1439588.0	7.0	94137.0
45569	45569	DBLP-QuAD	10000.0	10.0	9999.0
45923	45923	IndoorScenes	15620.0	3.0	67.0
45936	45936	IndoorScenes	15620.0	3.0	67.0
46346	46346	tiny-imagenet-200	100000.0	3.0	200.0
46347	46347	tiniest-imagenet-200	4000.0	2.0	200.0

Download datasets¶

# This is done based on the dataset ID.
dataset = openml.datasets.get_dataset(dataset_id="eeg-eye-state", version=1)

# Print a summary
print(
    f"This is dataset '{dataset.name}', the target feature is "
    f"'{dataset.default_target_attribute}'"
)
print(f"URL: {dataset.url}")
print(dataset.description[:500])

This is dataset 'eeg-eye-state', the target feature is 'Class'
URL: https://api.openml.org/data/v1/download/1587924/eeg-eye-state.arff
**Author**: Oliver Roesler
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State), Baden-Wuerttemberg, Cooperative State University (DHBW), Stuttgart, Germany
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after

Get the actual data.

openml-python returns data as pandas dataframes (stored in the eeg variable below), and also some additional metadata that we don’t care about right now.

eeg, *_ = dataset.get_data()

You can optionally choose to have openml separate out a column from the dataset. In particular, many datasets for supervised problems have a set default_target_attribute which may help identify the target variable.

X, y, categorical_indicator, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute
)
print(X.head())
print(X.info())

        V1       V2       V3       V4  ...      V11      V12      V13      V14
0  4329.23  4009.23  4289.23  4148.21  ...  4211.28  4280.51  4635.90  4393.85
1  4324.62  4004.62  4293.85  4148.72  ...  4207.69  4279.49  4632.82  4384.10
2  4327.69  4006.67  4295.38  4156.41  ...  4206.67  4282.05  4628.72  4389.23
3  4328.72  4011.79  4296.41  4155.90  ...  4210.77  4287.69  4632.31  4396.41
4  4326.15  4011.79  4292.31  4151.28  ...  4212.82  4288.21  4632.82  4398.46

[5 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14980 entries, 0 to 14979
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   V1      14980 non-null  float64
 1   V2      14980 non-null  float64
 2   V3      14980 non-null  float64
 3   V4      14980 non-null  float64
 4   V5      14980 non-null  float64
 5   V6      14980 non-null  float64
 6   V7      14980 non-null  float64
 7   V8      14980 non-null  float64
 8   V9      14980 non-null  float64
 9   V10     14980 non-null  float64
 10  V11     14980 non-null  float64
 11  V12     14980 non-null  float64
 12  V13     14980 non-null  float64
 13  V14     14980 non-null  float64
dtypes: float64(14)
memory usage: 1.6 MB
None

Sometimes you only need access to a dataset’s metadata. In those cases, you can download the dataset without downloading the data file. The dataset object can be used as normal. Whenever you use any functionality that requires the data, such as get_data, the data will be downloaded. Starting from 0.15, not downloading data will be the default behavior instead. The data will be downloading automatically when you try to access it through openml objects, e.g., using dataset.features.

dataset = openml.datasets.get_dataset(dataset_id="eeg-eye-state", version=1, download_data=False)

Exercise 2¶

Explore the data visually.

eegs = eeg.sample(n=1000)
_ = pd.plotting.scatter_matrix(
    X.iloc[:100, :4],
    c=y[:100],
    figsize=(10, 10),
    marker="o",
    hist_kwds={"bins": 20},
    alpha=0.8,
    cmap="plasma",
)

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/plotting/_matplotlib/misc.py:97: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  ax.scatter(

Edit a created dataset¶

This example uses the test server, to avoid editing a dataset on the main server.

Warning

This example uploads data. For that reason, this example connects to the test server at test.openml.org. This prevents the main server from crowding with example datasets, tasks, runs, and so on. The use of this test server can affect behaviour and performance of the OpenML-Python API.

openml.config.start_using_configuration_for_example()

/home/runner/work/openml-python/openml-python/examples/30_extended/datasets_tutorial.py:114: UserWarning: Switching to the test server https://test.openml.org/api/v1/xml to not upload results to the live server. Using the test server may result in reduced performance of the API!
  openml.config.start_using_configuration_for_example()

Edit non-critical fields, allowed for all authorized users: description, creator, contributor, collection_date, language, citation, original_data_url, paper_url

desc = (
    "This data sets consists of 3 different types of irises' "
    "(Setosa, Versicolour, and Virginica) petal and sepal length,"
    " stored in a 150x4 numpy.ndarray"
)
did = 128
data_id = edit_dataset(
    did,
    description=desc,
    creator="R.A.Fisher",
    collection_date="1937",
    citation="The use of multiple measurements in taxonomic problems",
    language="English",
)
edited_dataset = get_dataset(data_id)
print(f"Edited dataset ID: {data_id}")

Edited dataset ID: 128

Editing critical fields (default_target_attribute, row_id_attribute, ignore_attribute) is allowed only for the dataset owner. Further, critical fields cannot be edited if the dataset has any tasks associated with it. To edit critical fields of a dataset (without tasks) owned by you, configure the API key: openml.config.apikey = ‘FILL_IN_OPENML_API_KEY’ This example here only shows a failure when trying to work on a dataset not owned by you:

try:
    data_id = edit_dataset(1, default_target_attribute="shape")
except openml.exceptions.OpenMLServerException as e:
    print(e)

https://test.openml.org/api/v1/xml/data/edit returned code 1065: Critical features default_target_attribute, row_id_attribute and ignore_attribute can be edited only by the owner. Fork the dataset if changes are required. - None

Fork dataset¶

Used to create a copy of the dataset with you as the owner. Use this API only if you are unable to edit the critical fields (default_target_attribute, ignore_attribute, row_id_attribute) of a dataset through the edit_dataset API. After the dataset is forked, you can edit the new version of the dataset using edit_dataset.

data_id = fork_dataset(1)
print(data_id)
data_id = edit_dataset(data_id, default_target_attribute="shape")
print(f"Forked dataset ID: {data_id}")

openml.config.stop_using_configuration_for_example()

886
Forked dataset ID: 886

Total running time of the script: (0 minutes 10.710 seconds)

Gallery generated by Sphinx-Gallery