From Agriculture to Zoology, there is a wealth of open data in virtually every discipline that can be used for research & development in Human-Centered AI and Machine Learning.
The Human-Centered AI Lab (Holzinger Group) fully supports the “open” movement, i.e. open access, open source and open data. The idea of “open data” is not new. Many researchers in the past had followed the notion that Science is a public enterprise and that certain data should be openly available [1], [2]. For example the British Medical Journal (BMJ) started already in 2012 a big open data campaign [3]. With the launch of open data government initiatives the open data movement gained momentum [4] and some speak already about an Open Knowledge Foundation [5]. On this page we provide for our students a collection of links to selected open data sets that have an impact on human life, sorted alphabetically from Agriculture to Zoology.
[1] L. Rowen, G. K. S. Wong, R. P. Lane, and L. Hood, “Intellectual property – Publication rights in the era of open data release policies,” Science, vol. 289, pp. 1881-1881, Sep 2000.
[2] G. Boulton, M. Rawlins, P. Vallance, and M. Walport, “Science as a public enterprise: the case for open data,” The Lancet, vol. 377, pp. 1633-1635, // 2011.
[3] M. Thompson and C. Heneghan, “BMJ OPEN DATA CAMPAIGN We need to move the debate on open clinical trial data forward,” British Medical Journal, vol. 345, Dec 2012.
[4] N. Shadbolt, K. O’Hara, T. Berners-Lee, N. Gibbins, H. Glaser, W. Hall, et al., “Open Government Data and the Linked Data Web: Lessons from data. gov. uk,” IEEE Intelligent Systems, pp. 16-24, 2012.
[5] J. C. Molloy, “The Open Knowledge Foundation: Open Data Means Better Science,” Plos Biology, vol. 9, Dec 2011.
Agriculture:
- Lu/Young, A survey of public datasets for computer vision tasks in precision agriculture, Computers and Electronic in Agriculture https://doi.org/10.1016/j.compag.2020.105760: surveys 34 public datasets on weed control, fruit detection and on miscellaneous applications
- Tseng et al, CropHarvest: a global satellite dataset for crop type classification, NeurIPS 2021 paper https://openreview.net/pdf?id=JtjzUXPEaCu; dataset https://github.com/nasaharvest/cropharvest; satellite dataset of more than 90,000 geographically diverse samples with agricultural class labels
- Singh et al, PlantDoc: A Dataset for Visual Plant Disease Detection, paper https://arxiv.org/pdf/1911.10317.pdf; dataset: https://public.roboflow.com/object-detection/plantdoc; PlantDoc is a dataset of 2,569 images across 13 plant species and 30 classes (diseased and healthy) for image classification and object detection. There are 8,851 labels.
- Danilevicz et al, Resources for image-based high-throughput phenotyping in crops and data sharing challenges, Plant Physiology 2021, 699 https://doi.org/10.1093/plphys/kiab301; Includes a comprehensive list of available phenotypic datasets to assist crop breeders and tool developers
- Haug/Ostermann, A Crop/Weed Field Image Dataset for the Evaluation of Computer Vision Based Precision Agriculture Tasks, in Agapito/Bronstein/Rother (eds), Computer Vision – ECCV 2014 Workshops (2015); a benchmark dataset for crop/weed discrimination, single plant phenotyping and other open computer vision tasks in precision agriculture. The dataset comprises 60 images with annotations. https://github.com/cwfid
- Statistik Austria, Agriculture and forestry, statistical census data on land use, livestocks etc http://www.statistik.at/web_en/statistics/Economy/agriculture_and_forestry/index.html
- Eden Library: High quality plant datasets https://edenlibrary.ai/
- Quantitative plant: website presenting image analysis software tools and models for plants https://www.quantitative-plant.org/
- Kaggle: 432 datasets tagged with “agriculture”: diverse (international) datasets e.g. on food consumption, plant infections, fertilizer, pesticides https://www.kaggle.com/datasets?search=tag%3A%27agriculture%27
- Data world: 1000 international and diverse datasets tagged with “agriculture” e.g. on fish, crop statistics from Africa, GPS data https://data.world/datasets/agriculture
- IEEE data port, datasets on “precision agriculture” e.g. senor data of mobile sink, on sugarcane/paddy vegetation https://ieee-dataport.org/keywords/precision-agriculture
- [Potter, How to Get Machine Learning Training Datasets for Agriculture? (2020) https://medium.com/nerd-for-tech/how-to-get-machine-learning-training-datasets-for-agriculture-6e98d090b414: lists different types of datasets in agriculture]
- [Zheng et al, CropDeep: The Crop Vision Dataset for Deep-Learning-Based Classification and Detection in Precision Agriculture, Sensors 2019, https://doi.org/10.3390/s19051058; CropDeep species classification and detection dataset, consisting of 31,147 images with over 49,000 annotated instances from 31 different classes]
Climate:
- Meta list: Ambalina, 11 Best Climate Change Datasets for Data Science Projects (2021) https://hackernoon.com/top-datasets-on-climate-change-for-data-science-projects-rzz34p0
- Meta list: Pangeo ML Datasets, Weather and Climate Datasets for AI Research¶, collection of preprocessed and raw datasets and hybrid ML-physics models http://mldata.pangeo.io/
- Kaggle Climate Change: Earth Surface Temperature Data, temperature recordings from the Earth’s surface sine 1750 https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
- World Bank, Data from World Development Indicators and Climate Change Knowledge Portal on climate systems, exposure to climate impacts, resilience, greenhouse gas emissions, and energy use https://datacatalog.worldbank.org/search/dataset/0040205
- Kaggle International Greenhouse Gas Emissions (carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), hydrofluorocarbons (HFCs), perfluorocarbons (PFCs), unspecified mix of HFCs and PFCs, sulphur hexafluoride (SF6), nitrogen trifluoride (NF3)) from 1990 to 2017 https://www.kaggle.com/unitednations/international-greenhouse-gas-emissions
- EarthData, https://earthdata.nasa.gov/: collection of diverse datasets about the earth including climate data
- Kaggle Daily Sea Ice Extent Data. Total sea ice extent from 1978 to present. This datasets has information on the Earth’s cryosphere, and includes glacier, ice, snow and frozen ground data. The dataset has seven columns: year, month, day, extent, missing, source, and hemisphere https://www.kaggle.com/nsidcorg/daily-sea-ice-extent-data
- Abatzoglou et al, TerraClimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958–2015, Scientic Data 2018, https://doi.org/10.1038/sdata.2017.191; dataset: https://climatedataguide.ucar.edu/climate-data/terraclimate-global-high-resolution-gridded-temperature-precipitation-and-other-water; a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958–2015
- Prabhat et al, ClimateNet: an expert-labeled open dataset and deep learning architecture for enabling high-precision analyses of extreme weather, Geoscientific Model Development, 2021, 107 https://doi.org/10.5194/gmd-14-107-2021; dataset: https://portal.nersc.gov/project/ClimateNet/; an open, community-sourced human-expert-labeled curated dataset that captures tropical cyclones (TCs) and atmospheric rivers (ARs) in high resolution climate model output from a simulation of a recent historical period.
- Max-Planck Institut für Biogeochemie, Jena, Wetterstation Beutenberg, Dataset consists of 14 features such as temperature, pressure, humidity, etc; https://www.bgc-jena.mpg.de/wetter/ (see as supplementary material Salinas, Climate Forecasting with Deep Learning and Keras (2021) https://diegosalinas-47084.medium.com/climate-forecasting-with-deep-learning-and-keras-ba75f72e9672)
- Läderach et al, Replication Data for: Climate change adaptation of coffee production in space and time (2019), paper: https://doi.org/10.7910/DVN/TSUPE1; dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TSUPE1; impact of climate change on coffee production quality in Nicaragua.
- Climate Change Knowledge Portal, provides global data on historical and future climate, vulnerabilities and impacts https://climateknowledgeportal.worldbank.org
- Climate Data Online; provides free access to NCDC’s archive of global historical weather and climate data in addition to station history information. These data include quality controlled daily, monthly, seasonal, and yearly measurements of temperature, precipitation, wind, and degree days as well as radar data and 30-year Climate Normals; https://datacatalog.library.wayne.edu/dataset/climate-data-online
- Climate Data: National Climate Centre, Bureau of Meteorology. Three datasets containing climate data, compiled in April 2011. These datasets include observations from stations in all Australian States and Territories. https://data.gov.au/data/dataset/climate-data-national-climate-centre-bureau-of-meteorology
- Litman et al, Climate Change Tweets Ids (2019), https://doi.org/10.7910/DVN/5QCCUU; dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2F5QCCUU&; contains the IDs from over 39 million tweets about climate change; tweets were tracked and curated using hashtags related to climate change
- CrowdFlower, Sentiment of Climate Change; Contributors evaluated tweets for belief in the existence of global warming or climate change; a confidence score for the classification of each tweet is also provided; https://data.world/crowdflower/sentiment-of-climate-change
Forestry:
- Diez et al, Deep Learning in Forestry Using UAV-Acquired RGB Data: A Practical Review, remote sensing 2021 https://doi.org/10.3390/rs13142837; surveys and lists available datasets
- Blackard/Dean/Anderson, Covertype DataSet: Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. https://archive.ics.uci.edu/ml/datasets/covertype (see also as a supplement Iyim, Predicting Forest Cover Types with the Machine Learning Workflow (2020) https://towardsdatascience.com/predicting-forest-cover-types-with-the-machine-learning-workflow-1f6f049bf4df]
- Mendely Data, See the forest and the trees: Effective machine and deep learning algorithms for wood filtering and tree species classification from terrestrial laser scanning; dataset includes 45 natural forest scan clips with manually labelled classes (1: stem, 2: branch, 3: other). https://data.mendeley.com/datasets/4gbzk9sy24/1
- Melander/Einola/Ritala, Fusion of open forest data and machine feldbus data for performance analysis of forest machines, European Journal of Forest Research 2020, https://doi.org/10.1007/s10342-019-01237-8; surveying (and combining) datasets available via the Finish Open Source platform
- Cortez/Morais, A Data Mining Approach to Predict Forest Fires using Meteorological Data Forest (2007) paper: http://www3.dsi.uminho.pt/pcortez/fires.pdf; Dataset: https://archive.ics.uci.edu/ml/datasets/forest+fires; includes data on date, spatial coordinates, temp, wind, rain, humidity etc
- [requires registration] Planet, Access high-resolution satellite monitoring of the tropics to reduce and reverse tropical forest loss. Provides mosaic data covering tropical forested regions between 30 degrees North and 30 degrees South. https://www.planet.com/nicfi/
- [Competition Kaggle Forest Cover Type Prediction (2014): In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data); includes datasets (using the dataset from Blackard/Dean/Anderson mentioned above): The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations). https://www.kaggle.com/c/forest-cover-type-prediction/overview]
- [Competition Kaggle, Planet: Understanding the Amazon from Space. Use satellite data to track the human footprint in the Amazon rainforest (2017); includes training data i.e. imagery of the Amazon basin captured by Planet’s Flock 2 satellites between January 1st, 2016, and February 1st, 2017. The images contain the visible red (R), green (G), and blue (B) and near-infrared (NIR) bands. https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/overview; see as supplementary material Di Martino, Monitoring deforestation with open data and Machine Learning — Part 1 (2021) https://medium.com/digital-sense-ai/monitoring-deforestation-with-open-data-and-machine-learning-part-1-24d29c346752]
Law:
- Metalist: Guha, Datasets for Machine Learning in Law; lists available datasets e.g. for judgement prediction, document/contract annotation, summarization, question answering and documentation classification; https://github.com/neelguha/legal-ml-datasets
- Metalist: Must-read Papers on Legal Intelligence, Datasets; includes a survey of available datasets (most of them in English or Chinese) https://github.com/thunlp/LegalPapers
- Chalkidis/Fergadiotis/Androutsopoulos, MultiEURLEX — A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer, https://arxiv.org/abs/2109.00904; dataset: MultiEURLEX comprises 65k EU laws in 23 official EU languages; https://huggingface.co/datasets/multi_eurlex
- Chalkidis et al, LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, https://arxiv.org/abs/2110.00976; dataset: Legal General Language Understanding Evaluation (LexGLUE) benchmark, a benchmark dataset to evaluate the performance of NLP methods in legal tasks; https://huggingface.co/datasets/lex_glue
- Chalkidis ea, Large-Scale Multi-Label Text Classification on EU Legislation, https://arxiv.org/abs/1906.02192; EURLEX57K is a new publicly available legal LMTC dataset, dubbed EURLEX57K, containing 57k English EU legislative documents from the EUR-LEX portal, tagged with ∼3k labels (concepts) from the European Vocabulary (EUROVOC); https://paperswithcode.com/dataset/eurlex57k
- Chalkidis/Androutsopoulos/Aletras, Neural Legal Judgment Prediction in English, https://arxiv.org/abs/1906.0205; dataset: ECHR is an English legal judgment prediction dataset of cases from the European Court of Human Rights (ECHR). The dataset contains ~11.5k cases, including the raw text; https://paperswithcode.com/dataset/echr
- Hendrycks et al, CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review, https://arxiv.org/abs/2103.06268; dataset: Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review in connection with a corporate transaction, including mergers & acquisitions, etc; https://www.atticusprojectai.org/cuad
- Holzenberger/Blair-Stanek/Van Durme, A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering, http://ceur-ws.org/Vol-2645/paper5.pdf; Holzenberger/Van Durme, Factoring Statutory Reasoning as Language Understanding Challenges, https://arxiv.org/abs/2105.07903; dataset: SARA (StAtutory Reasoning Assessment); a dataset for statutory reasoning in tax law entailment and question answering; https://paperswithcode.com/dataset/sara
- Ostendorff/Blume/Ostendorff, Towards an Open Platform for Legal Information, https://arxiv.org/abs/2005.13342; Open Legal Data: collection of diverse legal datasets; https://github.com/openlegaldata
- Zheng et al, When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings, in ACM (eds), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (2021) 159-168, https://doi-org.uaccess.univie.ac.at/10.1145/3462757.3466088; dataset: CaseHOLD dataset (Case Holdings On Legal Decisions) provides 53,000+ multiple choice questions with prompts from a judicial decision and multiple potential holdings, one of which is correct, that could be cited; https://github.com/reglab/casehold
- German Legal Sentences (GLS) is an automatically generated training dataset for semantic sentence matching and citation recommendation in the domain in German legal documents; https://huggingface.co/datasets/lavis-nlp/german_legal_sentences
- Legal Case Reports Data Set: dataset contains Australian legal cases from the Federal Court of Australia (FCA); https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports
- Model: Chalkidis et al, LEGAL-BERT: The Muppets straight out of Law School, https://arxiv.org/abs/2010.02559; LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications; https://huggingface.co/nlpaueb/legal-bert-base-uncased
Zoology:
- Metalist: Wikipedia List of datasets for machine-learning research, Biological Data/Animal, https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research#Animal
- LILA BC; Labeled Information Library of Alexandria: Biology and Conservation: a repository for data sets related to biology and conservation, intended as a resource for both machine learning (ML) researchers and those that want to harness ML for biology and conservation; diverse datasets e.g. camera traps, seals, bees and pollen, seabirds, zebras and giraffes; https://lila.science/datasets
- g. Swanson et al, Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna, scientific data 2015, https://www.nature.com/articles/sdata201526; dataset: Snapshot Serengeti contains approximately 2.65M sequences of camera trap images, totalling 7.1M images; labels are provided for 61 categories, primarily at the species level (for example, the most common labels are wildebeest, zebra, and Thomson’s gazelle); https://lila.science/datasets/snapshot-serengeti
- Clark/Schreter/Adams, A quantitative comparison of dystal and backpropagation, Proceedings of 1996 Australian Conference on Neural Networks (1996); dataset: Abalone Data Set; physical measurements of abalone; weather patterns and location are also given; https://archive.ics.uci.edu/ml/datasets/abalone
- Jiang/Zhou, Editing Training Data for kNN Classifiers with Neural Network Ensemble, Advances in Neural Networks 2004, 356–361; dataset: Zoo Data Set, seven classes of animals, containing 17 Boolean-valued attributes; https://archive.ics.uci.edu/ml/datasets/zoo
- Ontañón/Plaza, On Similarity Measures Based on a Refinement Lattice, in McGinty/Wilson (eds), Case-Based Reasoning Research and Development. 8th International Conference on Case-Based Reasoning, ICCBR 2009 Seattle, WA, USA, July 20-23, 2009 Proceedings (2009) 240-255; dataset: Demospongiae Data Set, dataset contains 503 sponges belonging to the Demospongiae class collected from the Mediterranean (451 sponges) and Atlantic oceans (52 sponges); each sponge is classified according to a hierarchy formed by: order, family, genus and specie; each order is subdivided in several families; each family is also divided in several genus, and each genus in several species; https://archive.ics.uci.edu/ml/datasets/Demospongiae
- Parkhi et al, Cats and Dogs, IEEE Conference on Computer Vision and Pattern Recognition 2012, https://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/; dataset: The Oxford-IIIT Pet Dataset: a 37-category pet dataset with roughly 200 images for each class; https://www.robots.ox.ac.uk/~vgg/data/pets/
- Van Horn et al, The iNaturalist Species Classification and Detection Dataset, https://arxiv.org/abs/1707.06642; dataset: contains 675,170 training and validation images from 5,089 natural fine-grained categories; those categories belong to 13 super-categories including Plantae (Plant), Insecta (Insect), Aves (Bird), Mammalia (Mammal) etc; https://paperswithcode.com/dataset/inaturalist
- Welinder et al, Caltech-UCSD Birds 200, California Institute of Technology http://www.vision.caltech.edu/visipedia/papers/WelinderEtal10_CUB-200.pdf; dataset: Caltech-UCSD Birds 200 is an image dataset with photos of 200 bird species (mostly North American); http://www.vision.caltech.edu/visipedia/CUB-200.html
- Xian et al, Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly; IEEE Transactions on Pattern Analysis and Machine Intelligence 2019, 2251-2265, 10.1109/TPAMI.2018.2857768; dataset: animals with attributes 2; dataset consists of 37322 images of 50 animal classes with pre-extracted feature representations for each image; https://cvml.ist.ac.at/AwA2/
- world, Measurements Used to Determine the Sex of Bristle-thighed Curlews (Numenius tahitiensis); https://data.world/us-doi-gov/bc0dcabe-a455-4b33-baf1-f17778008f1b
- world, Daily survival rates of grassland passerines and associated weather variables; https://data.world/us-doi-gov/b28a0d9c-9aef-4571-9a08-12c817740985
- world, Zoo Animal Lifespans, Life expectancy estimates for North American zoo and aquarium vertebrate animals; https://data.world/animals/zoo-animal-lifespans
- Kaggle, Zoo Animal Classification, consists of 101 animals from a zoo (there are 16 variables with various traits to describe the animals; the 7 class types are: mammal, bird, reptile, fish, amphibian, bug and invertebrate); https://www.kaggle.com/uciml/zoo-animal-classification
- Kaggle, Competition; The Nature Conservancy Fisheries Monitoring: eight target categories are available in this dataset: Albacore tuna, Bigeye tuna, Yellowfin tuna, Mahi Mahi, Opah, Sharks, Other (meaning that there are fish present but not in the above categories), and No Fish (meaning that no fish is in the picture); each image has only one fish category, except that there are sometimes very small fish in the pictures that are used as bait; https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/data
- Kaggle, 10 Monkey Species: image dataset for fine-grain classification; https://www.kaggle.com/slothkong/10-monkey-species
- Kaggle, Animals-10: animal pictures of 10 different categories taken from google images; contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spider, butterfly, chicken, sheep, cow, squirrel, elephant; https://www.kaggle.com/alessiocorrado99/animals10
- Kaggle, STL-10 Image Recognition Dataset: train models to recognize different animals and vehicles; with a corpus of 100,000 unlabeled images and 500 training images; https://www.kaggle.com/jessicali9530/stl10
- Kaggle, bird species classification: interspecies classification of species in high resolution images; https://www.kaggle.com/akash2907/bird-species-classification
- [Martineau et al, A survey on image-based insect classification, Pattern Recognition 2017, 273-284, https://doi.org/10.1016/j.patcog.2016.12.020; surveys a range of entomology datasets (not all are publicly available)]
- [Higuera/Gardiner/Cios, Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome, PlosOne 2015, https://doi.org/10.1371/journal.pone.0129126; dataset: Mice Protein Expression Data Set; expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning; https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression]
Metalists of open data repositories:
Metaliste 1: https://github.com/beamandrew/medical-data
Metaliste 2: https://sites.google.com/site/aacruzr/image-datasets (mit einer noch deutlich ausführlicheren Sektion zu “Histology and Histopathology”)
Metaliste 3: http://www.aylward.org/notes/open-access-medical-image-repositories
Metaliste 4: https://radiopaedia.org/articles/imaging-data-sets-artificial-intelligence
Metaliste 5: https://www.ilovephd.com/medical-image-datasets-download-links/
Metaliste 6: https://medium.com/@ODSC/15-open-datasets-for-healthcare-830b19980d9
Metaliste 7: https://lionbridge.ai/datasets/18-free-life-sciences-medical-datasets-for-machine-learning/
Metaliste 8: https://opendatascience.com/15-open-datasets-for-healthcare/
Data repositories of general and public interest
- GenBank: GenBank is a genetic sequence database from the National Insititute of Health, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.
https://www.ncbi.nlm.nih.gov/genbank/ - EMBL: The European Bioinformatics Insitute is part of the European Molecular Biology Laboratory and maintain the world’s most comprehensive range of freely available and up-to-date molecular databases. The services let share data, perform complex queries and analyse the results. Everybody can download data and software, or use web services. More about in the journal Nucleic Acids Research.
https://www.ebi.ac.uk/services - HMCA: Health and Medical Care Archive is a data archive of the Robert Wood Johnson Foundation preserves and disseminates data collected by selected research projects and facilitates secondary analyses of the data. The data collections in HMCA include surveys of health care professionals and organizations, investigations of access to medical care, surveys on substance abuse, and evaluations of innovative programs for the delivery of health care. Their goal is to increase understanding of health and health care in the United States through secondary analysis.
https://www.icpsr.umich.edu/icpsrweb/HMCA/index.jsp - http://apps.who.int/gho/data/node.resources WHO: Provides datasets based on global health priorities. The organization includes easy search and provides insights for topics along with the datasets.
- https://wonder.cdc.gov/Welcome.html CDC: Use this for US-specific public health. The CDC maintains WONDER (Wide-ranging Online Data for Epidemiological Research) and sets are searchable by topic, state, and other factors.
Data repositories specialized by data types:
- UniProtKB/Swiss-Prot: is a manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB).
It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions.
https://www.uniprot.org/uniprot/ - MMMP: is an open access interactive multidatabase for research on melanoma biology and treatment.
https://www.mmmp.org/MMMP/ - KEGG: is a database for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
https://www.genome.jp/kegg/ - PDB: Since 1971, the Protein Data Bank archive has served as a repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies. The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community.
https://www.wwpdb.org/
Data repositories specialized by organism:
- WormBase: is an international consortium of biologists and computer scientists dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes. Founded in 2000, the WormBase Consortium is led by Paul Sternberg of CalTech, Paul Kersey of the EBI, Matt Berriman of the Wellcome Trust Sanger Institute, and Lincoln Stein of the Ontario Institute for Cancer Research.
https://www.wormbase.org - FlyBase: is a data repository project carried out by a consortium of Drosophila researchers and computer scientists at: Harvard University, University of Cambridge (UK), Indiana University and the University of New Mexico.
https://flybase.org/ - Human Brain NeuroMorpho: is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata.
https://neuromorpho.org/
Data Repositories for Scientific Research
- http://www.chdstudies.org/research/information_for_researchers.php CHDS: Child Health and Development Studies datasets are intended to research how disease and health pass down through generation. It contains datasets for research into not just genomic expression but how social, environmental, and cultural factors play into disease and health.
- http://leo.ugr.es/elvira/DBCRepository/ Kent Ridge Biomedical Datasets: High-dimensional datasets in the biomedical field. It focuses on journal-published data (Nature, Science, and others).
- [https://www.kaggle.com/c/MerckActivity/data Merck Molecular Health Activity Challenge: Datasets designed to foster the machine learning pursuit of drug discovery by simulating how molecule combinations could interact with each other.
- https://seer.cancer.gov/explorer/ SEER: Datasets arranged by demographic groups and provided by the US government. You can search based on age, race, and gender.
Image Data sets:
- CT Medical Images: This one is a small dataset, but it’s specifically cancer-related. It contains labeled images with age, modality, and contrast tags. Again, high-quality images associated with training data may help speed breakthroughs.
- Deep Lesion: One of the largest image sets currently available. CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. It includes over 32,000 lesions from 4000 unique patients.
- Cancer Instance Segmentation and Classification
https://www.kaggle.com/andrewmvd/cancer-inst-segmentation-and-classification - EchoNet-Dynamic. A Large New Cardiac Motion Video Data Resource for Medical Machine Learning (benötigt Registrierung) https://echonet.github.io/dynamic/index.html
- MedPix® is a free open-access online database of medical images, teaching cases, and clinical topics, integrating images and textual metadata including over 12,000 patient case scenarios, 9,000 topics, and nearly 59,000 images. Our primary target audience includes physicians and nurses, allied health professionals, medical students, nursing students and others interested in medical knowledge. https://medpix.nlm.nih.gov/home
- Function MRI images for 539 individuals suffering from ASD and 573 typical controls. These 1112 datasets are composed of structural and resting state functional MRI data along with an extensive array of phenotypic information. Requires registration. http://fcon_1000.projects.nitrc.org/indi/abide/
- Cancer Imaging Archive (mit zahlreichen Untersammlungen): https://www.cancerimagingarchive.net/
- The DRIVE database is for comparative studies on segmentation of blood vessels in retinal images. It consists of 40 photographs out of which 7 showing signs of mild early diabetic retinopathy https://drive.grand-challenge.org/
- Isic Archive – Melanoma This archive contains 23k images of classified skin lesions. It contains both malignant and benign examples. https://www.isic-archive.com/#!/topWithHeader/wideContentTop/main
- The Sunnybrook Cardiac Data (SCD), also known as the 2009 Cardiac MR Left Ventricle Segmentation Challenge data, consist of 45 cine-MRI images from a mixed of patients and pathologies: healthy, hypertrophy, heart failure with infarction and heart failure without infarction. Subset of this data set was first used in the automated myocardium segmentation challenge from short-axis MRI, held by a MICCAI workshop in 2009. The whole complete data set is now available in the CAP database with public domain license. http://www.cardiacatlas.org/studies/sunnybrook-cardiac-data/
- DDSM: The Digital Database for Screening Mammography (DDSM) is a resource for use by the mammographic image analysis research community. http://www.eng.usf.edu/cvprg/Mammography/Database.html
- The NLM Visible Human Project has created publicly-available complete, anatomically detailed, three-dimensional representations of a human male body and a human female body. Specifically, the VHP provides a public-domain library of cross-sectional cryosection, CT, and MRI images obtained from one male cadaver and one female cadaver. The Visible Man data set was publicly released in 1994 and the Visible Woman in 1995. The data sets were designed to serve as (1) a reference for the study of human anatomy, (2) public-domain data for testing medical imaging algorithms, and (3) a test bed and model for the construction of network-accessible image libraries. The VHP data sets have been applied to a wide range of educational, diagnostic, treatment planning, virtual reality, artistic, mathematical, and industrial uses. About 4,000 licensees from 66 countries were authorized to access the datasets. As of 2019, a license is no longer required to access the VHP datasets. https://www.nlm.nih.gov/research/visible/visible_human.html
- The mini-MIAS database of mammograms http://peipa.essex.ac.uk/info/mias.html
- Prostate Cancer Data Set: http://i2cvb.github.io/
- Multiple Data sets: Lession Segmentation in Multiple Sclerosis, x-rays, ultra sound images of carotid, Datasets (ucy.ac.cy)
- Via Group Public Databases: lung CT images in the DICOM format together with documentation of abnormalities by radiologists VIA (cornell.edu)
- SCR database: Segmentation in Chest Radiographs http://www.isi.uu.nl/Research/Databases/SCR/
- Histology (CIMA) dataset: The dataset consists of 2D histological microscopy tissue slices, stained with different stains, and landmarks denoting key-points in each slice https://cmp.felk.cvut.cz/~borovji3/?page=dataset
- The USC-SIPI Image Database: The USC-SIPI image database is a collection of digitized images. It is maintained primarily to support research in image processing, image analysis, and machine vision. The first edition of the USC-SIPI image database was distributed in 1977 and many new images have been added since then. http://sipi.usc.edu/database/
- CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets. https://stanfordmlgroup.github.io/competitions/chexpert/
- PadChest: A large chest x-ray image dataset with multi-label annotated reports https://bimcv.cipf.es/bimcv-projects/padchest/
- Medicare Provider Payment Data https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier
- Warwick QU- dataset of the MICCAI challenge: https://warwick.ac.uk/fac/cross_fac/tia/data/glascontest/download
- Allgemeiner Data Set Aggregator/Forum (vgl Kaggle) https://www.reddit.com/r/datasets/
Time-oriented Datasets (longitudinal data and time series)
Broad collection from the U.S. Bureau of Labor Statistics (but also includes health aspects and pychical problems, etc, according to the description): The National Longitudinal Surveys (NLS) are a set of surveys designed to gather information at multiple points in time on the labor market activities and other significant life events of several groups of men and women. NLS data have served as an important tool for economists, sociologists, and other researchers for more than 50 years. Learn about the different NLS cohorts.
https://www.bls.gov/nls/home.htm
Same source but more precise: NLSY97 Data Overview
The NLSY97 consists of a nationally representative sample of 8,984 men and women born during the years 1980 through 1984 and living in the United States at the time of the initial survey in 1997. Participants were ages 12 to 16 as of December 31, 1996. Interviews were conducted annually from 1997 to 2011 and biennially since then. The ongoing cohort has been surveyed 18 times as of date. Data are available from Round 1 (1997-98) through Round 18 (2017-18).
https://www.bls.gov/nls/nlsy97.htm
The Longitudinal Studies of Aging (LSOAs) is a collaborative project of National Center for Health Statistics (NCHS) and the National Institute on Aging (NIA). It is a multicohort study of persons 70 years of age and over designed primarily to measure changes in the health, functional status, living arrangements, and health services utilization of two cohorts of Americans as they move into and through the oldest ages.
https://www.cdc.gov/nchs/lsoa/index.htm
A more general but smaller collection of the british Longitudinal Data Sets:
https://ukdataservice.ac.uk/get-data/key-data/cohort-and-longitudinal-studies
https://www.ukdataservice.ac.uk/get-data/themes/health.aspx
English Longitudinal Study of Ageing
The English Longitudinal Study of Ageing (ELSA) study is a longitudinal survey of ageing and quality of life among older people. It explores the dynamic relationships between health and functioning, social networks and participation, and economic position as people plan for, move into and progress beyond retirement.
https://beta.ukdataservice.ac.uk/datacatalogue/series/series?id=200011
Health and Retirement Study
This is a longitudinal study that surveys thousands of Americans over the age of 50 every two years. It began in 1992. It looks in-depth at health, health insurance, work, retirement, income, wealth, family characteristics, and inter-generational transfers through extensive interviews with survey participants. Data products are freely available online to registered users.
https://hrsonline.isr.umich.edu/index.php
Duke Center for Study of Aging and Human Development
– This dataset includes 11,199 respondents who were interviewed in 2000, and their follow-ups at the waves in 2002, 2005, 2008-09, 2011-12, and 2014. The dataset includes changes in their health conditions, family structures, sociodemographic characteristics, healthcare, and lifestyle in addition to survival status (i.e., died, still alive, and lost to follow-up) at each subsequent wave for each re-visited respondent. The dataset also includes date at death and other nearly 40 questions (including health status, living arrangement, healthcare expense, etc.) before dying collected from the next-of-kin for those deceased respondents who died between two adjacent waves. Only cross-sectional weight in 2000 is included in the dataset.
https://sites.duke.edu/centerforaging/programs/chinese-longitudinal-healthy-longevity-survey-clhls/cross-sectional-dataset/longitudinal-panel-datasets/
China Health Nutrition Survey
https://www.cpc.unc.edu/projects/china/data/datasets/longitudinal
Some older sources:
Harvard School of Public Health. Longitudinal Studies of Child Health and Development Records, 1918-2015 (inclusive), 1930-1989 (bulk) Dataverse
the Harvard School of Public Health Longitudinal Studies of Child Health and Development records, 1918-2015 (inclusive), 1930-1989 (bulk) is a collection of research data, administrative, and publishing records generated as a product of research by the Department of Maternal and Child Health on the health, physical development, and social functioning of a set of subjects from birth through adulthood.
https://dataverse.harvard.edu/dataverse/HSPH_LSCHD
with a stronger focus on economic issues:
Early Childhood Longitudinal Study Program Data
The ECLS program includes three longitudinal studies that examine child development, school readiness, and early school experiences. Data on children’s status at birth and at various points thereafter; children’s transitions to nonparental care, early education programs, and school; and children’s experiences and growth.
https://nces.ed.gov/ecls/index.asp
The German Socio-Economic Panel (SOEP) is a longitudinal survey of approximately 11,000 private households in the Federal Republic of Germany from 1984 to 2018 and the eastern German länder from 1990 to 2018 (release February 2020). The database is produced by the Deutsches Institut für Wirtschaftsforschung (DIW), Berlin.
Variables include household composition, employment, occupation, earnings, health and satisfaction indicators.
https://www.eui.eu/Research/Library/ResearchGuides/Economics/Statistics/DataPortal/GSOEP
The Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world. The study began in 1968 with a nationally representative sample of over 18,000 individuals living in 5,000 families in the United States. Information on these individuals and their descendants has been collected continuously, including data covering employment, income, wealth, expenditures, health, marriage, childbearing, child development, philanthropy, education, and numerous other topics. The PSID is directed by faculty at the University of Michigan, and the data are available on this website without cost to researchers and analysts.
https://psidonline.isr.umich.edu/
Longitudinal and demographic data (Focus on Australia)
https://libguides.library.qut.edu.au/c.php?g=428685&p=2923802
Some special Data Sets
[7] 1000 Genomes: A deep catalog of human genetic variation. The projects sequenced the genomes of a large number of people in order to provide a comprehensive resource on human genetic variation. It contains about 2,500 samples from 2010 and 2011:
https://www.1000genomes.org/ftpsearch
1000 Genomes Project Consortium and others. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-1073.
[8] Tiny Images dataset: The data set consists of over 79 million images in color. They are stored in a 227 Gb binary file. A Matlab toolbox to access the images is provided. Automatic annotation data is available for all images, but manual annotation data is only available for a smaller portion:
A. Torralba and R. Fergus and W. T. Freeman. 2008. 80 Million Tiny Images: a Large Database for Non-Parametric Object and Scene Recognition. IEEE PAMI, 30 (11), 1958-1970.
[9] Just to make ones familiar with the abundance of different skin diseases, a very informative collection of skin images, provided by Healthline.com: https://www.healthline.com/health/skin-disorders
[10] Breast Tumor (gene expression) data of Van’t Veer (2002): The training data set consists of 78 primary breast cancers of which 34 patients developed metastasses within 5 years. The training set contains 19 breast cancer patients of which 12 developed metastases within 5 years. The data contains 24188 gene expression levels. The general goal is predicting metastases for improving the therapy strategy:
https://www.stats.uwo.ca/faculty/aim/2015/9850/microarrays/FitMArray/chm/Veer.html
Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. & Witteveen, A. T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, (6871), 530-536.
[11] Machine Learning Repository data sets from the Center for Machine Learning and Intelligent Systems, University of California, maintains 313 data sets as a service to the machine learning community:
https://archive.ics.uci.edu/ml/index.php
Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[12] Data.Medicare.gov: Data sets from Medicare.gov for downloading, exploring, and visualizing. Direct access to data sets, including data sets from hospitals, nursing homes, physicians, homes, supplierers and other facilities is provided. The data gives general information about the quality of care in these facilities:
https://data.medicare.gov/
[13] re3data.org: a Registry of Research Data Repositories. Research data repositories from different academic disciplines are featured here. The projects promotes a culture of sharing between researchers. It started in 2012 and is funded by the German Research Foundation:
https://www.re3data.org/
Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, et al. 2013. Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE, 8 (11).
[14] Time series data as a sequence of point sets collected over a time intervall are widely used, e.g. in biomedicine (heart rate, ECG, EEG, etc.), but also in many other fields e.g. in astronomy or eartyquake prediction. The University of California Riverside (UCR) Time Series Classification and Clustering Collection has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering:
https://www.cs.ucr.edu/~eamonn/time_series_data/
Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7, (4), 349-371.
[15] The MNIST database of handwritten digits includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image:
https://yann.lecun.com/exdb/mnist/
Liu, C. L., Nakashima, K., Sako, H. & Fujisawa, H. 2003. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36, (10), 2271-2285.
[16] KONECT (the Koblenz Network Collection) gathers large network datasets of various types. The over 200 open datasets are collected by the Institute of Web Science and Technologies at the University of Koblenz-Landau.
Kunegis, Jérôme (2013). KONECT – The Koblenz Network Collection. Proc. Int. Conf. on World Wide Web Companion, pages 1343-1350.
[17] Kaggle offeres competitions and thus provides many different kinds of real-world open data for scientists.
https://www.kaggle.com/
[18] CKAN serves as a data management tool used by organizations, research institutions and governments since 2006. It has been developed by the Open Knowledge Foundation.
https://datahub.io/
[19] The goal of healtdata.gov is to make health data more accessible for research. It contains ovre 1800 datasets at the moment.
https://www.healthdata.gov/
[20] Socrata is a cloud software company which also provides open datasets of many different topics.
https://opendata.socrata.com/
General Life Sciences, Healthcare and Medical Datasets
- https://bchi.bigcitieshealth.org/indicators/1827/searches/22971 Big Cities Health Inventory Data Platform: Health data from 26 cities, for 34 health indicators, across 6 demographic indicators.
- https://catalog.data.gov/dataset/u-s-chronic-disease-indicators-cdi Chronic Disease Data: Data on chronic disease indicators throughout the US.
- https://www.mortality.org/ Human Mortality Database: Mortality and population data for over 35 countries.
- https://archive.ics.uci.edu/ml/datasets/MHEALTH+Dataset MHealth (Mobile Health) Dataset: Body motion and vital signs recordings for ten volunteers of diverse profile, while performing physical activities
- https://dbarchive.biosciencedbc.jp/index-e.html Life Science Database Archive: Datasets generated by life scientists in Japan in a long-term and stable state as national public goods. The Archive makes it easier for many people to search datasets by metadata in a unified format, and to access and download the datasets with clear use terms.
Image Datasets for Life Sciences, Healthcare and Medicine
- http://www.oasis-brains.org/ OASIS: The Open Access Series of Imaging Studies (OASIS) is a project aimed at making neuroimaging datasets of the brain freely available to the scientific community. They compile and freely distribute neuroimaging datasets, with the hope of aiding future discoveries in basic and clinical neuroscience.
- https://openfmri.org/ OpenfMRI: Magnetic resonance imaging (MRI) datasets openly available to the research community.
- http://adni.loni.usc.edu/ ADNI: Alzheimer’s Disease Neuroimaging Initiative (ADNI) researchers collect several types of data from volunteer study participants. The data is available for free to authorized investigators, but requires an application and prior approval.
Genome Datasets
- https://portal.gdc.cancer.gov
- https://www.ncbi.nlm.nih.gov/gds GEO Datasets: This database stores curated gene expression datasets, as well as original series and platform records in the gene expression omnibus (GEO) repository.
- https://registry.opendata.aws/giab/ Genome in a Bottle: Dataset includes several reference genomes to enable translation of whole human genome sequencing to clinical practice.
Hospital Datasets
- https://hcup-us.ahrq.gov/databases.jsp Healthcare Cost and Utilization Project (HCUP): Datasets contain encounter-level information on impatient stays, emergency department visits, and ambulatory surgery in US hospitals.
- https://mimic.physionet.org/ MIMIC Critical Care Database: MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising unidentified health data associated with approximately 40,000 critical care patients. The dataset includes demographics, vital signs, laboratory tests, medications, and more.
Cancer Datasets
- http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi BROAD Institute Cancer Program Datasets: Data categorized by project such as brain cancer, leukemia, melanoma, etc.
XAI evaluation datasets
- Evaluation of post-hoc XAI approaches through synthetic tabular data. Tritscher, Julian; Ring, Markus; Schlör, Daniel; Hettinger, Lena; Hotho, Andreas in International Symposium on Methodologies for Intelligent Systems (2020).
https://www.informatik.uni-wuerzburg.de/datascience/projects/deepscan/xai-eval-data/ - KANDINSKYPatterns – An experimental exploration environment for Pattern Analysis and Machine Intelligence. Holzinger, A., Saranti, A. & Mueller, H. (2021).