Datasets

How to list and download datasets.

# License: BSD 3-Clauses

import openml
import pandas as pd

Exercise 0

  • List datasets

    • Use the output_format parameter to select output type

    • Default gives ‘dict’ (other option: ‘dataframe’, see below)

openml_list = openml.datasets.list_datasets()  # returns a dict

# Show a nice table with some key data properties
datalist = pd.DataFrame.from_dict(openml_list, orient='index')
datalist = datalist[[
    'did', 'name', 'NumberOfInstances',
    'NumberOfFeatures', 'NumberOfClasses'
]]

print(f"First 10 of {len(datalist)} datasets...")
datalist.head(n=10)

# The same can be done with lesser lines of code
openml_df = openml.datasets.list_datasets(output_format='dataframe')
openml_df.head(n=10)

Out:

First 10 of 2989 datasets...
did name version uploader status format MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize NumberOfClasses NumberOfFeatures NumberOfInstances NumberOfInstancesWithMissingValues NumberOfMissingValues NumberOfNumericFeatures NumberOfSymbolicFeatures
2 2 anneal 1 1 active ARFF 684.0 7.0 8.0 5.0 39.0 898.0 898.0 22175.0 6.0 33.0
3 3 kr-vs-kp 1 1 active ARFF 1669.0 3.0 1527.0 2.0 37.0 3196.0 0.0 0.0 0.0 37.0
4 4 labor 1 1 active ARFF 37.0 3.0 20.0 2.0 17.0 57.0 56.0 326.0 8.0 9.0
5 5 arrhythmia 1 1 active ARFF 245.0 13.0 2.0 13.0 280.0 452.0 384.0 408.0 206.0 74.0
6 6 letter 1 1 active ARFF 813.0 26.0 734.0 26.0 17.0 20000.0 0.0 0.0 16.0 1.0
7 7 audiology 1 1 active ARFF 57.0 24.0 1.0 24.0 70.0 226.0 222.0 317.0 0.0 70.0
8 8 liver-disorders 1 1 active ARFF NaN NaN NaN 0.0 6.0 345.0 0.0 0.0 6.0 0.0
9 9 autos 1 1 active ARFF 67.0 22.0 3.0 6.0 26.0 205.0 46.0 59.0 15.0 11.0
10 10 lymph 1 1 active ARFF 81.0 8.0 2.0 4.0 19.0 148.0 0.0 0.0 3.0 16.0
11 11 balance-scale 1 1 active ARFF 288.0 3.0 49.0 3.0 5.0 625.0 0.0 0.0 4.0 1.0


Exercise 1

  • Find datasets with more than 10000 examples.

  • Find a dataset called ‘eeg_eye_state’.

  • Find all datasets with more than 50 classes.

datalist[datalist.NumberOfInstances > 10000
         ].sort_values(['NumberOfInstances']).head(n=20)
did name NumberOfInstances NumberOfFeatures NumberOfClasses
23515 23515 sulfur 10081.0 7.0 0.0
372 372 internet_usage 10108.0 72.0 46.0
981 981 kdd_internet_usage 10108.0 69.0 2.0
1536 1536 volcanoes-b6 10130.0 4.0 5.0
1531 1531 volcanoes-b1 10176.0 4.0 5.0
1534 1534 volcanoes-b4 10190.0 4.0 5.0
1459 1459 artificial-characters 10218.0 8.0 10.0
1478 1478 har 10299.0 562.0 6.0
1533 1533 volcanoes-b3 10386.0 4.0 5.0
1532 1532 volcanoes-b2 10668.0 4.0 5.0
42183 42183 dataset_sales 10738.0 15.0 0.0
1053 1053 jm1 10885.0 22.0 2.0
1414 1414 Kaggle_bike_sharing_demand_challange 10886.0 12.0 0.0
1044 1044 eye_movements 10936.0 28.0 3.0
32 32 pendigits 10992.0 17.0 10.0
1019 1019 pendigits 10992.0 17.0 2.0
42199 42199 Player_names 11009.0 3.0 NaN
4534 4534 PhishingWebsites 11055.0 31.0 2.0
399 399 ohscal.wc 11162.0 11466.0 10.0
310 310 mammography 11183.0 7.0 2.0


datalist.query('name == "eeg-eye-state"')
did name NumberOfInstances NumberOfFeatures NumberOfClasses
1471 1471 eeg-eye-state 14980.0 15.0 2.0


datalist.query('NumberOfClasses > 50')
did name NumberOfInstances NumberOfFeatures NumberOfClasses
1491 1491 one-hundred-plants-margin 1600.0 65.0 100.0
1492 1492 one-hundred-plants-shape 1600.0 65.0 100.0
1493 1493 one-hundred-plants-texture 1599.0 65.0 100.0
4552 4552 BachChoralHarmony 5665.0 17.0 102.0
41167 41167 dionis 416188.0 61.0 355.0
41169 41169 helena 65196.0 28.0 100.0
41960 41960 seattlecrime6 523590.0 8.0 144.0
42078 42078 beer_reviews 1586614.0 13.0 104.0
42087 42087 beer_reviews 1586614.0 13.0 104.0
42088 42088 beer_reviews 1586614.0 13.0 104.0
42089 42089 vancouver_employee 1586614.0 13.0 104.0
42123 42123 article_influence 3615.0 7.0 3169.0
42222 42222 Suicide_Dataset 27820.0 11.0 101.0
42223 42223 dataset-autoHorse_fixed 201.0 69.0 186.0


Download datasets

# This is done based on the dataset ID.
dataset = openml.datasets.get_dataset(1471)

# Print a summary
print(f"This is dataset '{dataset.name}', the target feature is "
      f"'{dataset.default_target_attribute}'")
print(f"URL: {dataset.url}")
print(dataset.description[:500])

Out:

This is dataset 'eeg-eye-state', the target feature is 'Class'
URL: https://www.openml.org/data/v1/download/1587924/eeg-eye-state.arff
**Author**: Oliver Roesler
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State), Baden-Wuerttemberg, Cooperative State University (DHBW), Stuttgart, Germany
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after

Get the actual data.

The dataset can be returned in 2 possible formats: as a NumPy array, a SciPy sparse matrix, or as a Pandas DataFrame (or SparseDataFrame). The format is controlled with the parameter dataset_format which can be either ‘array’ (default) or ‘dataframe’. Let’s first build our dataset from a NumPy array and manually create a dataframe.

X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='array',
    target=dataset.default_target_attribute
)
eeg = pd.DataFrame(X, columns=attribute_names)
eeg['class'] = y
print(eeg[:10])

Out:

            V1           V2           V3  ...          V13          V14  class
0  4329.229980  4009.229980  4289.229980  ...  4635.899902  4393.850098      0
1  4324.620117  4004.620117  4293.850098  ...  4632.819824  4384.100098      0
2  4327.689941  4006.669922  4295.379883  ...  4628.720215  4389.229980      0
3  4328.720215  4011.790039  4296.410156  ...  4632.310059  4396.410156      0
4  4326.149902  4011.790039  4292.310059  ...  4632.819824  4398.459961      0
5  4321.029785  4004.620117  4284.100098  ...  4628.209961  4389.740234      0
6  4319.490234  4001.030029  4280.509766  ...  4625.129883  4378.459961      0
7  4325.640137  4006.669922  4278.459961  ...  4622.049805  4380.509766      0
8  4326.149902  4010.770020  4276.410156  ...  4627.180176  4389.740234      0
9  4326.149902  4011.280029  4276.919922  ...  4637.439941  4393.330078      0

[10 rows x 15 columns]

Instead of manually creating the dataframe, you can already request a dataframe with the correct dtypes.

X, y, categorical_indicator, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute,
    dataset_format='dataframe'
)
print(X.head())
print(X.info())

Out:

        V1       V2       V3       V4  ...      V11      V12      V13      V14
0  4329.23  4009.23  4289.23  4148.21  ...  4211.28  4280.51  4635.90  4393.85
1  4324.62  4004.62  4293.85  4148.72  ...  4207.69  4279.49  4632.82  4384.10
2  4327.69  4006.67  4295.38  4156.41  ...  4206.67  4282.05  4628.72  4389.23
3  4328.72  4011.79  4296.41  4155.90  ...  4210.77  4287.69  4632.31  4396.41
4  4326.15  4011.79  4292.31  4151.28  ...  4212.82  4288.21  4632.82  4398.46

[5 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14980 entries, 0 to 14979
Data columns (total 14 columns):
V1     14980 non-null float64
V2     14980 non-null float64
V3     14980 non-null float64
V4     14980 non-null float64
V5     14980 non-null float64
V6     14980 non-null float64
V7     14980 non-null float64
V8     14980 non-null float64
V9     14980 non-null float64
V10    14980 non-null float64
V11    14980 non-null float64
V12    14980 non-null float64
V13    14980 non-null float64
V14    14980 non-null float64
dtypes: float64(14)
memory usage: 1.6 MB
None

Sometimes you only need access to a dataset’s metadata. In those cases, you can download the dataset without downloading the data file. The dataset object can be used as normal. Whenever you use any functionality that requires the data, such as get_data, the data will be downloaded.

dataset = openml.datasets.get_dataset(1471, download_data=False)

Exercise 2

  • Explore the data visually.

eegs = eeg.sample(n=1000)
_ = pd.plotting.scatter_matrix(
    eegs.iloc[:100, :4],
    c=eegs[:100]['class'],
    figsize=(10, 10),
    marker='o',
    hist_kwds={'bins': 20},
    alpha=.8,
    cmap='plasma'
)
../../_images/sphx_glr_datasets_tutorial_001.png

Total running time of the script: ( 0 minutes 6.903 seconds)

Gallery generated by Sphinx-Gallery