Datasets

How to list and download datasets.

import openml
import pandas as pd

Exercise 0

  • List datasets * Use the output_format parameter to select output type * Default gives ‘dict’ (other option: ‘dataframe’)

openml_list = openml.datasets.list_datasets()  # returns a dict

# Show a nice table with some key data properties
datalist = pd.DataFrame.from_dict(openml_list, orient='index')
datalist = datalist[[
    'did', 'name', 'NumberOfInstances',
    'NumberOfFeatures', 'NumberOfClasses'
]]

print("First 10 of %s datasets..." % len(datalist))
datalist.head(n=10)

# The same can be done with lesser lines of code
openml_df = openml.datasets.list_datasets(output_format='dataframe')
openml_df.head(n=10)

Out:

First 10 of 2946 datasets...

Exercise 1

  • Find datasets with more than 10000 examples.

  • Find a dataset called ‘eeg_eye_state’.

  • Find all datasets with more than 50 classes.

datalist[datalist.NumberOfInstances > 10000
         ].sort_values(['NumberOfInstances']).head(n=20)
datalist.query('name == "eeg-eye-state"')
datalist.query('NumberOfClasses > 50')

Download datasets

# This is done based on the dataset ID.
dataset = openml.datasets.get_dataset(1471)

# Print a summary
print("This is dataset '%s', the target feature is '%s'" %
      (dataset.name, dataset.default_target_attribute))
print("URL: %s" % dataset.url)
print(dataset.description[:500])

Out:

This is dataset 'eeg-eye-state', the target feature is 'Class'
URL: https://www.openml.org/data/v1/download/1587924/eeg-eye-state.arff
**Author**: Oliver Roesler
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State), Baden-Wuerttemberg, Cooperative State University (DHBW), Stuttgart, Germany
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after

Get the actual data.

The dataset can be returned in 2 possible formats: as a NumPy array, a SciPy sparse matrix, or as a Pandas DataFrame (or SparseDataFrame). The format is controlled with the parameter dataset_format which can be either ‘array’ (default) or ‘dataframe’. Let’s first build our dataset from a NumPy array and manually create a dataframe.

X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='array',
    target=dataset.default_target_attribute
)
eeg = pd.DataFrame(X, columns=attribute_names)
eeg['class'] = y
print(eeg[:10])

Out:

            V1           V2           V3  ...          V13          V14  class
0  4329.229980  4009.229980  4289.229980  ...  4635.899902  4393.850098      0
1  4324.620117  4004.620117  4293.850098  ...  4632.819824  4384.100098      0
2  4327.689941  4006.669922  4295.379883  ...  4628.720215  4389.229980      0
3  4328.720215  4011.790039  4296.410156  ...  4632.310059  4396.410156      0
4  4326.149902  4011.790039  4292.310059  ...  4632.819824  4398.459961      0
5  4321.029785  4004.620117  4284.100098  ...  4628.209961  4389.740234      0
6  4319.490234  4001.030029  4280.509766  ...  4625.129883  4378.459961      0
7  4325.640137  4006.669922  4278.459961  ...  4622.049805  4380.509766      0
8  4326.149902  4010.770020  4276.410156  ...  4627.180176  4389.740234      0
9  4326.149902  4011.280029  4276.919922  ...  4637.439941  4393.330078      0

[10 rows x 15 columns]

Instead of manually creating the dataframe, you can already request a dataframe with the correct dtypes.

X, y, categorical_indicator, attribute_names = dataset.get_data(
    target=dataset.default_target_attribute,
    dataset_format='dataframe'
)
print(X.head())
print(X.info())

Out:

/home/travis/build/openml/openml-python/openml/datasets/dataset.py:448: UserWarning: Cannot convert non-sparse data to 'dataframe'. Returning input data.
  warn("Cannot convert {} to '{}'. Returning input data.".format(data_type, array_format))
        V1       V2       V3       V4  ...      V11      V12      V13      V14
0  4329.23  4009.23  4289.23  4148.21  ...  4211.28  4280.51  4635.90  4393.85
1  4324.62  4004.62  4293.85  4148.72  ...  4207.69  4279.49  4632.82  4384.10
2  4327.69  4006.67  4295.38  4156.41  ...  4206.67  4282.05  4628.72  4389.23
3  4328.72  4011.79  4296.41  4155.90  ...  4210.77  4287.69  4632.31  4396.41
4  4326.15  4011.79  4292.31  4151.28  ...  4212.82  4288.21  4632.82  4398.46

[5 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14980 entries, 0 to 14979
Data columns (total 14 columns):
V1     14980 non-null float64
V2     14980 non-null float64
V3     14980 non-null float64
V4     14980 non-null float64
V5     14980 non-null float64
V6     14980 non-null float64
V7     14980 non-null float64
V8     14980 non-null float64
V9     14980 non-null float64
V10    14980 non-null float64
V11    14980 non-null float64
V12    14980 non-null float64
V13    14980 non-null float64
V14    14980 non-null float64
dtypes: float64(14)
memory usage: 1.6 MB
None

Sometimes you only need access to a dataset’s metadata. In those cases, you can download the dataset without downloading the data file. The dataset object can be used as normal. Whenever you use any functionality that requires the data, such as get_data, the data will be downloaded.

dataset = openml.datasets.get_dataset(1471, download_data=False)

Exercise 2

  • Explore the data visually.

eegs = eeg.sample(n=1000)
_ = pd.plotting.scatter_matrix(
    eegs.iloc[:100, :4],
    c=eegs[:100]['class'],
    figsize=(10, 10),
    marker='o',
    hist_kwds={'bins': 20},
    alpha=.8,
    cmap='plasma'
)
../_images/sphx_glr_datasets_tutorial_001.png

Total running time of the script: ( 0 minutes 7.258 seconds)

Gallery generated by Sphinx-Gallery