Datasets
Data Formats¶
To guarantee interoperability, we focus on a limited set of data formats. We aim to support all sorts of data, but for the moment we only fully support tabular data in the ARFF format. We are currently working on supporting a much wider range of formats.
ARFF definition. Also check that attribute definitions do not mix spaces and tabs, and do not include end-of-line comments.
Data repositories¶
This is a list of public dataset repositories that offer additional useful machine learning datasets. These have widely varying data formats, so they require manual selection, parsing and meta-data extraction.
A collection of sources made by different users
- https://github.com/caesar0301/awesome-public-datasets
- https://dreamtolearn.com/ryan/1001_datasets
- https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
- https://pathmind.com/wiki/open-datasets
- https://paperswithcode.com/
- https://medium.com/towards-artificial-intelligence/best-datasets-for-machine-learning-data-science-computer-vision-nlp-ai-c9541058cf4f
- https://lionbridge.ai/datasets/the-50-best-free-datasets-for-machine-learning/
- https://www.v7labs.com/open-datasets?utm_source=v7&utm_medium=email&utm_campaign=edu_outreach
Machine learning dataset repositories (mostly already in OpenML)
- UCI: https://archive.ics.uci.edu/ml/index.html
- KEEL: http://sci2s.ugr.es/keel/datasets.php
- LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
- AutoWEKA datasets: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/
- skData package: https://github.com/jaberg/skdata/tree/master/skdata
- Rdatasets: http://vincentarelbundock.github.io/Rdatasets/datasets.html
- DataBrewer: https://github.com/rmax/databrewer
- liac-arff: https://github.com/renatopp/arff-datasets
MS Open datasets:
APIs (mostly defunct):
- databrewer (Python): https://pypi.org/project/databrewer/
- PyDataset (Python): https://github.com/iamaziz/PyDataset (wrapper for Rdatasets?)
- RDatasets (R): https://github.com/vincentarelbundock/Rdatasets
Time series / Geo data:
- Data commons: https://datacommons.org/
- UCR: http://timeseriesclassification.com/
- Older version: http://www.cs.ucr.edu/~eamonn/time_series_data/
Deep learning datasets (mostly image data)
- https://www.tensorflow.org/datasets/catalog/overview
- http://deeplearning.net/datasets/
- https://deeplearning4j.org/opendata
- http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
- https://paperswithcode.com/datasets
Extreme classification:
MLData (down)
AutoWEKA datasets:
Kaggle public datasets
RAMP Challenge datasets
Wolfram data repository
Data.world
Figshare (needs digging, lots of Excel files)
KDNuggets list of data sets (meta-list, lots of stuff here):
Benchmark Data Sets for Highly Imbalanced Binary Classification
Feature Selection Challenge Datasets
BigML's list of 1000+ data sources
Massive list from Data Science Central.
R packages (also see https://github.com/openml/openml-r/issues/185)
- http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
- mlbench
- Stata datasets: http://www.stata-press.com/data/r13/r.html
UTwente Activity recognition datasets:
Vanderbilt:
Quandl
Microarray data:
- http://genomics-pubs.princeton.edu/oncology/
- http://svitsrv25.epfl.ch/R-doc/library/multtest/html/golub.html
Medical data:
- http://www.healthdata.gov/
- http://homepages.inf.ed.ac.uk/rbf/IAPR/researchers/PPRPAGES/pprdat.htm
- http://hcup-us.ahrq.gov/
- https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier.html
- https://nsduhweb.rti.org/respweb/homepage.cfm
- http://orwh.od.nih.gov/resources/policyreports/womenofcolor.asp
Nature.com Scientific data repositories list