Data Repositories

This is a list of public dataset repositories we aim to connect to for getting more varied datasets in OpenML. These have widely varying data formats, so we need both manual selection plus automatic conversion or meta-data extraction to make them easily usable.

A collection of sources made by different users

Machine learning dataset repositories (mostly already in OpenML)

APIs (mostly defunct): - databrewer (Python): https://pypi.org/project/databrewer/ - PyDataset (Python): https://github.com/iamaziz/PyDataset (wrapper for Rdatasets?) - RDatasets (R): https://github.com/vincentarelbundock/Rdatasets

Time series data:

Deep learning datasets (mostly image data)

Extreme classification:

MLData (down)

AutoWEKA datasets:

Kaggle public datasets

RAMP Challenge datasets

Wolfram data repository

Data.world

Figshare (needs digging, lots of Excel files)

KDNuggets list of data sets (meta-list, lots of stuff here):

Benchmark Data Sets for Highly Imbalanced Binary Classification

Feature Selection Challenge Datasets

BigML's list of 1000+ data sources

Massive list from Data Science Central.

R packages (also see https://github.com/openml/openml-r/issues/185)

UTwente Activity recognition datasets:

Vanderbilt:

Quandl

Microarray data:

Medical data:

Nature.com Scientific data repositories list