Data Repositories

This is a list of public dataset repositories we aim to connect to for getting more varied datasets in OpenML. These have widely varying data formats, so we need both manual selection plus automatic conversion or meta-data extraction to make them easily usable.

A collection of sources made by different users

Machine learning dataset repositories (mostly already in OpenML)

Time series data:

Deep learning datasets (mostly image data)

Extreme classification:

MLData (will merge with OpenML in 2017)

AutoWEKA datasets:

Kaggle public datasets

RAMP Challenge datasets

Wolfram data repository

Figshare (needs digging, lots of Excel files)

KDNuggets list of data sets (meta-list, lots of stuff here):

Benchmark Data Sets for Highly Imbalanced Binary Classification

Feature Selection Challenge Datasets

BigML's list of 1000+ data sources

Massive list from Data Science Central.

R packages (also see

UTwente Activity recognition datasets:



Microarray data:

Medical data: Scientific data repositories list