From Agriculture to Zoology, there is a wealth of open data in virtually every discipline that can be used for research & development in Human-Centered AI and Machine Learning.

The Human-Centered AI Lab (Holzinger Group) fully supports the “open” movement, i.e. open access, open source and open data. The idea of “open data” is noOpen-Medical-Datat new. Many researchers in the past had followed the notion that Science is a public enterprise and that certain data should be openly available [1], [2]. For example the British Medical Journal (BMJ) started already in 2012 a big open data campaign [3]. With the launch of open data government initiatives the open data movement gained momentum [4] and some speak already about an Open Knowledge Foundation [5]. On this page we provide for our students a collection of links to selected open data sets that have an impact on human life, sorted alphabetically from Agriculture to Zoology.

[1] L. Rowen, G. K. S. Wong, R. P. Lane, and L. Hood, “Intellectual property – Publication rights in the era of open data release policies,” Science, vol. 289, pp. 1881-1881, Sep 2000.

[2] G. Boulton, M. Rawlins, P. Vallance, and M. Walport, “Science as a public enterprise: the case for open data,” The Lancet, vol. 377, pp. 1633-1635, // 2011.

[3] M. Thompson and C. Heneghan, “BMJ OPEN DATA CAMPAIGN We need to move the debate on open clinical trial data forward,” British Medical Journal, vol. 345, Dec 2012.

[4] N. Shadbolt, K. O’Hara, T. Berners-Lee, N. Gibbins, H. Glaser, W. Hall, et al., “Open Government Data and the Linked Data Web: Lessons from data. gov. uk,” IEEE Intelligent Systems, pp. 16-24, 2012.

[5] J. C. Molloy, “The Open Knowledge Foundation: Open Data Means Better Science,” Plos Biology, vol. 9, Dec 2011.

Agriculture:

Climate:

Forestry:

  • Diez et al, Deep Learning in Forestry Using UAV-Acquired RGB Data: A Practical Review, remote sensing 2021 https://doi.org/10.3390/rs13142837; surveys and lists available datasets
  • Blackard/Dean/Anderson, Covertype DataSet: Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. https://archive.ics.uci.edu/ml/datasets/covertype (see also as a supplement Iyim, Predicting Forest Cover Types with the Machine Learning Workflow (2020) https://towardsdatascience.com/predicting-forest-cover-types-with-the-machine-learning-workflow-1f6f049bf4df]
  • Mendely Data, See the forest and the trees: Effective machine and deep learning algorithms for wood filtering and tree species classification from terrestrial laser scanning; dataset includes 45 natural forest scan clips with manually labelled classes (1: stem, 2: branch, 3: other). https://data.mendeley.com/datasets/4gbzk9sy24/1
  • Melander/Einola/Ritala, Fusion of open forest data and machine feldbus data for performance analysis of forest machines, European Journal of Forest Research 2020, https://doi.org/10.1007/s10342-019-01237-8; surveying (and combining) datasets available via the Finish Open Source platform
  • Cortez/Morais, A Data Mining Approach to Predict Forest Fires using Meteorological Data Forest (2007) paper: http://www3.dsi.uminho.pt/pcortez/fires.pdf; Dataset: https://archive.ics.uci.edu/ml/datasets/forest+fires; includes data on date, spatial coordinates, temp, wind, rain, humidity etc
  • [requires registration] Planet, Access high-resolution satellite monitoring of the tropics to reduce and reverse tropical forest loss. Provides mosaic data covering tropical forested regions between 30 degrees North and 30 degrees South. https://www.planet.com/nicfi/
  • [Competition Kaggle Forest Cover Type Prediction (2014): In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data); includes datasets (using the dataset from Blackard/Dean/Anderson mentioned above): The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations). https://www.kaggle.com/c/forest-cover-type-prediction/overview]
  • [Competition Kaggle, Planet: Understanding the Amazon from Space. Use satellite data to track the human footprint in the Amazon rainforest (2017); includes training data i.e. imagery of the Amazon basin captured by Planet’s Flock 2 satellites between January 1st, 2016, and February 1st, 2017. The images contain the visible red (R), green (G), and blue (B) and near-infrared (NIR) bands. https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/overview; see as supplementary material Di Martino, Monitoring deforestation with open data and Machine Learning — Part 1 (2021) https://medium.com/digital-sense-ai/monitoring-deforestation-with-open-data-and-machine-learning-part-1-24d29c346752]

Law:

Zoology:

  • Metalist: Wikipedia List of datasets for machine-learning research, Biological Data/Animal, https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research#Animal
  • LILA BC; Labeled Information Library of Alexandria: Biology and Conservation: a repository for data sets related to biology and conservation, intended as a resource for both machine learning (ML) researchers and those that want to harness ML for biology and conservation; diverse datasets e.g. camera traps, seals, bees and pollen, seabirds, zebras and giraffes; https://lila.science/datasets
    • g. Swanson et al, Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna, scientific data 2015, https://www.nature.com/articles/sdata201526; dataset: Snapshot Serengeti contains approximately 2.65M sequences of camera trap images, totalling 7.1M images; labels are provided for 61 categories, primarily at the species level (for example, the most common labels are wildebeest, zebra, and Thomson’s gazelle); https://lila.science/datasets/snapshot-serengeti
  • Clark/Schreter/Adams, A quantitative comparison of dystal and backpropagation, Proceedings of 1996 Australian Conference on Neural Networks (1996); dataset: Abalone Data Set; physical measurements of abalone; weather patterns and location are also given; https://archive.ics.uci.edu/ml/datasets/abalone
  • Jiang/Zhou, Editing Training Data for kNN Classifiers with Neural Network Ensemble, Advances in Neural Networks 2004, 356–361; dataset: Zoo Data Set, seven classes of animals, containing 17 Boolean-valued attributes; https://archive.ics.uci.edu/ml/datasets/zoo
  • Ontañón/Plaza, On Similarity Measures Based on a Refinement Lattice, in McGinty/Wilson (eds), Case-Based Reasoning Research and Development. 8th International Conference on Case-Based Reasoning, ICCBR 2009 Seattle, WA, USA, July 20-23, 2009 Proceedings (2009) 240-255; dataset: Demospongiae Data Set, dataset contains 503 sponges belonging to the Demospongiae class collected from the Mediterranean (451 sponges) and Atlantic oceans (52 sponges); each sponge is classified according to a hierarchy formed by: order, family, genus and specie; each order is subdivided in several families; each family is also divided in several genus, and each genus in several species; https://archive.ics.uci.edu/ml/datasets/Demospongiae
  • Parkhi et al, Cats and Dogs, IEEE Conference on Computer Vision and Pattern Recognition 2012, https://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/; dataset: The Oxford-IIIT Pet Dataset: a 37-category pet dataset with roughly 200 images for each class; https://www.robots.ox.ac.uk/~vgg/data/pets/
  • Van Horn et al, The iNaturalist Species Classification and Detection Dataset, https://arxiv.org/abs/1707.06642; dataset: contains 675,170 training and validation images from 5,089 natural fine-grained categories; those categories belong to 13 super-categories including Plantae (Plant), Insecta (Insect), Aves (Bird), Mammalia (Mammal) etc; https://paperswithcode.com/dataset/inaturalist
  • Welinder et al, Caltech-UCSD Birds 200, California Institute of Technology http://www.vision.caltech.edu/visipedia/papers/WelinderEtal10_CUB-200.pdf; dataset: Caltech-UCSD Birds 200 is an image dataset with photos of 200 bird species (mostly North American); http://www.vision.caltech.edu/visipedia/CUB-200.html
  • Xian et al, Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly; IEEE Transactions on Pattern Analysis and Machine Intelligence 2019, 2251-2265, 10.1109/TPAMI.2018.2857768; dataset: animals with attributes 2; dataset consists of 37322 images of 50 animal classes with pre-extracted feature representations for each image; https://cvml.ist.ac.at/AwA2/
  • world, Measurements Used to Determine the Sex of Bristle-thighed Curlews (Numenius tahitiensis); https://data.world/us-doi-gov/bc0dcabe-a455-4b33-baf1-f17778008f1b
  • world, Daily survival rates of grassland passerines and associated weather variables; https://data.world/us-doi-gov/b28a0d9c-9aef-4571-9a08-12c817740985
  • world, Zoo Animal Lifespans, Life expectancy estimates for North American zoo and aquarium vertebrate animals; https://data.world/animals/zoo-animal-lifespans
  • Kaggle, Zoo Animal Classification, consists of 101 animals from a zoo (there are 16 variables with various traits to describe the animals; the 7 class types are: mammal, bird, reptile, fish, amphibian, bug and invertebrate); https://www.kaggle.com/uciml/zoo-animal-classification
  • Kaggle, Competition; The Nature Conservancy Fisheries Monitoring: eight target categories are available in this dataset: Albacore tuna, Bigeye tuna, Yellowfin tuna, Mahi Mahi, Opah, Sharks, Other (meaning that there are fish present but not in the above categories), and No Fish (meaning that no fish is in the picture); each image has only one fish category, except that there are sometimes very small fish in the pictures that are used as bait; https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/data
  • Kaggle, 10 Monkey Species: image dataset for fine-grain classification; https://www.kaggle.com/slothkong/10-monkey-species
  • Kaggle, Animals-10: animal pictures of 10 different categories taken from google images; contains about 28K medium quality animal images belonging to 10 categories: dog, cat, horse, spider, butterfly, chicken, sheep, cow, squirrel, elephant; https://www.kaggle.com/alessiocorrado99/animals10
  • Kaggle, STL-10 Image Recognition Dataset: train models to recognize different animals and vehicles; with a corpus of 100,000 unlabeled images and 500 training images; https://www.kaggle.com/jessicali9530/stl10
  • Kaggle, bird species classification: interspecies classification of species in high resolution images; https://www.kaggle.com/akash2907/bird-species-classification
  • [Martineau et al, A survey on image-based insect classification, Pattern Recognition 2017, 273-284, https://doi.org/10.1016/j.patcog.2016.12.020; surveys a range of entomology datasets (not all are publicly available)]
  • [Higuera/Gardiner/Cios, Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome, PlosOne 2015, https://doi.org/10.1371/journal.pone.0129126; dataset: Mice Protein Expression Data Set; expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning; https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression]

Data repositories of general and public interest

  • GenBank: GenBank is a genetic sequence database from the National Insititute of Health, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42).  GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.
    https://www.ncbi.nlm.nih.gov/genbank/
  • EMBL: The European Bioinformatics Insitute is part of the European Molecular Biology Laboratory and maintain the world’s most comprehensive range of freely available and up-to-date molecular databases. The services let share data, perform complex queries and analyse the results. Everybody can download data and software, or use web services. More about in the journal Nucleic Acids Research.
    https://www.ebi.ac.uk/services
  • HMCA: Health and Medical Care Archive is a data archive of the Robert Wood Johnson Foundation preserves and disseminates data collected by selected research projects and facilitates secondary analyses of the data. The data collections in HMCA include surveys of health care professionals and organizations, investigations of access to medical care, surveys on substance abuse, and evaluations of innovative programs for the delivery of health care. Their goal is to increase understanding of health and health care in the United States through secondary analysis.
    https://www.icpsr.umich.edu/icpsrweb/HMCA/index.jsp
  • http://apps.who.int/gho/data/node.resources WHO: Provides datasets based on global health priorities. The organization includes easy search and provides insights for topics along with the datasets.
  • https://wonder.cdc.gov/Welcome.html CDC: Use this for US-specific public health. The CDC maintains WONDER (Wide-ranging Online Data for Epidemiological Research) and sets are searchable by topic, state, and other factors.

Data repositories specialized by data types:

  • UniProtKB/Swiss-Prot: is a manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB).
    It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions.
    https://www.uniprot.org/uniprot/
  • MMMP: is an open access interactive multidatabase for research on melanoma biology and treatment.
    https://www.mmmp.org/MMMP/
  • KEGG: is a database for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
    https://www.genome.jp/kegg/
  • PDB: Since 1971, the Protein Data Bank archive has served as a repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies. The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community.
    https://www.wwpdb.org/

Data repositories specialized by organism:

  • WormBase: is an international consortium of biologists and computer scientists dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes. Founded in 2000, the WormBase Consortium is led by Paul Sternberg of CalTech, Paul Kersey of the EBI, Matt Berriman of the Wellcome Trust Sanger Institute, and Lincoln Stein of the Ontario Institute for Cancer Research.
    https://www.wormbase.org
  • FlyBase: is a data repository project carried out by a consortium of Drosophila researchers and computer scientists at: Harvard University, University of Cambridge (UK), Indiana University and the University of New Mexico.
    https://flybase.org/
  • Human Brain NeuroMorpho: is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 100 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. To date, NeuroMorpho.Org is the largest collection of publicly accessible 3D neuronal reconstructions and associated metadata.
    https://neuromorpho.org/

Data Repositories for Scientific Research

  • http://www.chdstudies.org/research/information_for_researchers.php CHDS: Child Health and Development Studies datasets are intended to research how disease and health pass down through generation. It contains datasets for research into not just genomic expression but how social, environmental, and cultural factors play into disease and health.
  • http://leo.ugr.es/elvira/DBCRepository/ Kent Ridge Biomedical Datasets: High-dimensional datasets in the biomedical field. It focuses on journal-published data (Nature, Science, and others).
  • [https://www.kaggle.com/c/MerckActivity/data Merck Molecular Health Activity Challenge: Datasets designed to foster the machine learning pursuit of drug discovery by simulating how molecule combinations could interact with each other.
  • https://seer.cancer.gov/explorer/ SEER: Datasets arranged by demographic groups and provided by the US government. You can search based on age, race, and gender.

Image Data sets:

  • CT Medical Images: This one is a small dataset, but it’s specifically cancer-related. It contains labeled images with age, modality, and contrast tags. Again, high-quality images associated with training data may help speed breakthroughs.
  • Deep Lesion: One of the largest image sets currently available. CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. It includes over 32,000 lesions from 4000 unique patients.
  • Cancer Instance Segmentation and Classification
    https://www.kaggle.com/andrewmvd/cancer-inst-segmentation-and-classification
  • EchoNet-Dynamic. A Large New Cardiac Motion Video Data Resource for Medical Machine Learning (benötigt Registrierung) https://echonet.github.io/dynamic/index.html
  • MedPix® is a free open-access online database of medical images, teaching cases, and clinical topics, integrating images and textual metadata including over 12,000 patient case scenarios, 9,000 topics, and nearly 59,000 images. Our primary target audience includes physicians and nurses, allied health professionals, medical students, nursing students and others interested in medical knowledge. https://medpix.nlm.nih.gov/home
  • Function MRI images for 539 individuals suffering from ASD and 573 typical controls. These 1112 datasets are composed of structural and resting state functional MRI data along with an extensive array of phenotypic information. Requires registration. http://fcon_1000.projects.nitrc.org/indi/abide/
  • Cancer Imaging Archive (mit zahlreichen Untersammlungen): https://www.cancerimagingarchive.net/
  • The DRIVE database is for comparative studies on segmentation of blood vessels in retinal images. It consists of 40 photographs out of which 7 showing signs of mild early diabetic retinopathy https://drive.grand-challenge.org/
  • Isic Archive – Melanoma This archive contains 23k images of classified skin lesions. It contains both malignant and benign examples. https://www.isic-archive.com/#!/topWithHeader/wideContentTop/main
  • The Sunnybrook Cardiac Data (SCD), also known as the 2009 Cardiac MR Left Ventricle Segmentation Challenge data, consist of 45 cine-MRI images from a mixed of patients and pathologies: healthy, hypertrophy, heart failure with infarction and heart failure without infarction. Subset of this data set was first used in the automated myocardium segmentation challenge from short-axis MRI, held by a MICCAI workshop in 2009. The whole complete data set is now available in the CAP database with public domain license. http://www.cardiacatlas.org/studies/sunnybrook-cardiac-data/
  • DDSM: The Digital Database for Screening Mammography (DDSM) is a resource for use by the mammographic image analysis research community. http://www.eng.usf.edu/cvprg/Mammography/Database.html
  • The NLM Visible Human Project has created publicly-available complete, anatomically detailed, three-dimensional representations of a human male body and a human female body. Specifically, the VHP provides a public-domain library of cross-sectional cryosection, CT, and MRI images obtained from one male cadaver and one female cadaver. The Visible Man data set was publicly released in 1994 and the Visible Woman in 1995. The data sets were designed to serve as (1) a reference for the study of human anatomy, (2) public-domain data for testing medical imaging algorithms, and (3) a test bed and model for the construction of network-accessible image libraries. The VHP data sets have been applied to a wide range of educational, diagnostic, treatment planning, virtual reality, artistic, mathematical, and industrial uses. About 4,000 licensees from 66 countries were authorized to access the datasets. As of 2019, a license is no longer required to access the VHP datasets. https://www.nlm.nih.gov/research/visible/visible_human.html
  • The mini-MIAS database of mammograms http://peipa.essex.ac.uk/info/mias.html
  • Prostate Cancer Data Set: http://i2cvb.github.io/
  • Multiple Data sets: Lession Segmentation in Multiple Sclerosis, x-rays, ultra sound images of carotid, Datasets (ucy.ac.cy)
  • Via Group Public Databases: lung CT images in the DICOM format together with documentation of abnormalities by radiologists VIA (cornell.edu)
  • SCR database: Segmentation in Chest Radiographs http://www.isi.uu.nl/Research/Databases/SCR/
  • Histology (CIMA) dataset: The dataset consists of 2D histological microscopy tissue slices, stained with different stains, and landmarks denoting key-points in each slice https://cmp.felk.cvut.cz/~borovji3/?page=dataset
  • The USC-SIPI Image Database: The USC-SIPI image database is a collection of digitized images. It is maintained primarily to support research in image processing, image analysis, and machine vision. The first edition of the USC-SIPI image database was distributed in 1977 and many new images have been added since then. http://sipi.usc.edu/database/
  • CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets. https://stanfordmlgroup.github.io/competitions/chexpert/
  • PadChest: A large chest x-ray image dataset with multi-label annotated reports https://bimcv.cipf.es/bimcv-projects/padchest/
  • Medicare Provider Payment Data https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier
  • Warwick QU- dataset of the MICCAI challenge: https://warwick.ac.uk/fac/cross_fac/tia/data/glascontest/download
  • Allgemeiner Data Set Aggregator/Forum (vgl Kaggle) https://www.reddit.com/r/datasets/

Time-oriented Datasets (longitudinal data and time series)

Broad collection from the U.S. Bureau of Labor Statistics (but also includes health aspects and pychical problems, etc, according to the description): The National Longitudinal Surveys (NLS) are a set of surveys designed to gather information at multiple points in time on the labor market activities and other significant life events of several groups of men and women. NLS data have served as an important tool for economists, sociologists, and other researchers for more than 50 years. Learn about the different NLS cohorts.
https://www.bls.gov/nls/home.htm

Same source but more precise: NLSY97 Data Overview
The NLSY97 consists of a nationally representative sample of 8,984 men and women born during the years 1980 through 1984 and living in the United States at the time of the initial survey in 1997. Participants were ages 12 to 16 as of December 31, 1996. Interviews were conducted annually from 1997 to 2011 and biennially since then. The ongoing cohort has been surveyed 18 times as of date. Data are available from Round 1 (1997-98) through Round 18 (2017-18).
https://www.bls.gov/nls/nlsy97.htm

The Longitudinal Studies of Aging (LSOAs) is a collaborative project of National Center for Health Statistics (NCHS) and the National Institute on Aging (NIA). It is a multicohort study of persons 70 years of age and over designed primarily to measure changes in the health, functional status, living arrangements, and health services utilization of two cohorts of Americans as they move into and through the oldest ages.
https://www.cdc.gov/nchs/lsoa/index.htm

A more general but smaller collection of the british Longitudinal Data Sets:
https://ukdataservice.ac.uk/get-data/key-data/cohort-and-longitudinal-studies
https://www.ukdataservice.ac.uk/get-data/themes/health.aspx

English Longitudinal Study of Ageing
The English Longitudinal Study of Ageing (ELSA) study is a longitudinal survey of ageing and quality of life among older people. It explores the dynamic relationships between health and functioning, social networks and participation, and economic position as people plan for, move into and progress beyond retirement.
https://beta.ukdataservice.ac.uk/datacatalogue/series/series?id=200011

Health and Retirement Study
This is a longitudinal study that surveys thousands of Americans over the age of 50 every two years. It began in 1992. It looks in-depth at health, health insurance, work, retirement, income, wealth, family characteristics, and inter-generational transfers through extensive interviews with survey participants. Data products are freely available online to registered users.
https://hrsonline.isr.umich.edu/index.php

Duke Center for Study of Aging and Human Development
– This dataset includes 11,199 respondents who were interviewed in 2000, and their follow-ups at the waves in 2002, 2005, 2008-09, 2011-12, and 2014. The dataset includes changes in their health conditions, family structures, sociodemographic characteristics, healthcare, and lifestyle in addition to survival status (i.e., died, still alive, and lost to follow-up) at each subsequent wave for each re-visited respondent. The dataset also includes date at death and other nearly 40 questions (including health status, living arrangement, healthcare expense, etc.) before dying collected from the next-of-kin for those deceased respondents who died between two adjacent waves. Only cross-sectional weight in 2000 is included in the dataset.
https://sites.duke.edu/centerforaging/programs/chinese-longitudinal-healthy-longevity-survey-clhls/cross-sectional-dataset/longitudinal-panel-datasets/

China Health Nutrition Survey
https://www.cpc.unc.edu/projects/china/data/datasets/longitudinal

Some older sources:

Harvard School of Public Health. Longitudinal Studies of Child Health and Development Records, 1918-2015 (inclusive), 1930-1989 (bulk) Dataverse
the Harvard School of Public Health Longitudinal Studies of Child Health and Development records, 1918-2015 (inclusive), 1930-1989 (bulk) is a collection of research data, administrative, and publishing records generated as a product of research by the Department of Maternal and Child Health on the health, physical development, and social functioning of a set of subjects from birth through adulthood.
https://dataverse.harvard.edu/dataverse/HSPH_LSCHD

with a stronger focus on economic issues:

Early Childhood Longitudinal Study Program Data
The ECLS program includes three longitudinal studies that examine child development, school readiness, and early school experiences. Data on children’s status at birth and at various points thereafter; children’s transitions to nonparental care, early education programs, and school; and children’s experiences and growth.
https://nces.ed.gov/ecls/index.asp

The German Socio-Economic Panel (SOEP) is a longitudinal survey of approximately 11,000 private households in the Federal Republic of Germany from 1984 to 2018 and the eastern German länder from 1990 to 2018 (release February 2020). The database is produced by the Deutsches Institut für Wirtschaftsforschung (DIW), Berlin.
Variables include household composition, employment, occupation, earnings, health and satisfaction indicators.
https://www.eui.eu/Research/Library/ResearchGuides/Economics/Statistics/DataPortal/GSOEP

The Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world. The study began in 1968 with a nationally representative sample of over 18,000 individuals living in 5,000 families in the United States. Information on these individuals and their descendants has been collected continuously, including data covering employment, income, wealth, expenditures, health, marriage, childbearing, child development, philanthropy, education, and numerous other topics. The PSID is directed by faculty at the University of Michigan, and the data are available on this website without cost to researchers and analysts.
https://psidonline.isr.umich.edu/

Longitudinal and demographic data (Focus on Australia)
https://libguides.library.qut.edu.au/c.php?g=428685&p=2923802

Some special Data Sets

[7] 1000 Genomes: A deep catalog of human genetic variation. The projects sequenced the genomes of a large number of people in order to provide a comprehensive resource on human genetic variation. It contains about 2,500 samples from 2010 and 2011:
https://www.1000genomes.org/ftpsearch

1000 Genomes Project Consortium and others. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-1073.

[8] Tiny Images dataset: The data set consists of over 79 million images in color. They are stored in a 227 Gb binary file. A Matlab toolbox to access the images is provided. Automatic annotation data is available for all images, but manual annotation data is only available for a smaller portion:

A. Torralba and R. Fergus and W. T. Freeman. 2008. 80 Million Tiny Images: a Large Database for Non-Parametric Object and Scene Recognition. IEEE PAMI, 30 (11), 1958-1970.

[9] Just to make ones familiar with the abundance of different skin diseases, a very informative collection of skin images, provided by Healthline.com: https://www.healthline.com/health/skin-disorders

[10] Breast Tumor (gene expression) data of Van’t Veer (2002): The training data set consists of 78 primary breast cancers of which 34 patients developed metastasses within 5 years. The training set contains 19 breast cancer patients of which 12 developed metastases within 5 years. The data contains 24188 gene expression levels. The general goal is predicting metastases for improving the therapy strategy:
https://www.stats.uwo.ca/faculty/aim/2015/9850/microarrays/FitMArray/chm/Veer.html

Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. & Witteveen, A. T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, (6871), 530-536.

[11] Machine Learning Repository data sets from the Center for Machine Learning and Intelligent Systems, University of California, maintains 313 data sets as a service to the machine learning community:
https://archive.ics.uci.edu/ml/index.php

Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[12] Data.Medicare.gov: Data sets from Medicare.gov for downloading, exploring, and visualizing. Direct access to data sets, including data sets from hospitals, nursing homes, physicians, homes, supplierers and other facilities is provided. The data gives general information about the quality of care in these facilities:
https://data.medicare.gov/

[13] re3data.org: a Registry of Research Data Repositories. Research data repositories from different academic disciplines are featured here. The projects promotes a culture of sharing between researchers. It started in 2012 and is funded by the German Research Foundation:
https://www.re3data.org/

Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, et al. 2013. Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE, 8 (11).

[14] Time series data as a sequence of point sets collected over a time intervall are widely used, e.g. in biomedicine (heart rate, ECG, EEG, etc.), but also in many other fields e.g. in astronomy or eartyquake prediction. The University of California Riverside (UCR) Time Series Classification and Clustering Collection has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering:
https://www.cs.ucr.edu/~eamonn/time_series_data/

Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7, (4), 349-371.

[15] The MNIST database of handwritten digits includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image:
https://yann.lecun.com/exdb/mnist/

Liu, C. L., Nakashima, K., Sako, H. & Fujisawa, H. 2003. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36, (10), 2271-2285.

[16] KONECT (the Koblenz Network Collection) gathers large network datasets of various types. The over 200 open datasets are collected by the Institute of Web Science and Technologies at the University of Koblenz-Landau.

Kunegis, Jérôme (2013). KONECT – The Koblenz Network Collection. Proc. Int. Conf. on World Wide Web Companion, pages 1343-1350.

[17] Kaggle offeres competitions and thus provides many different kinds of real-world open data for scientists.
https://www.kaggle.com/

[18] CKAN serves as a data management tool used by organizations, research institutions and governments since 2006. It has been developed by the Open Knowledge Foundation.
https://datahub.io/

[19] The goal of healtdata.gov is to make health data more accessible for research. It contains ovre 1800 datasets at the moment.
https://www.healthdata.gov/

[20] Socrata is a cloud software company which also provides open datasets of many different topics.
https://opendata.socrata.com/

General Life Sciences, Healthcare and Medical Datasets

Image Datasets for Life Sciences, Healthcare and Medicine

  • http://www.oasis-brains.org/ OASIS: The Open Access Series of Imaging Studies (OASIS) is a project aimed at making neuroimaging datasets of the brain freely available to the scientific community. They compile and freely distribute neuroimaging datasets, with the hope of aiding future discoveries in basic and clinical neuroscience.
  • https://openfmri.org/ OpenfMRI: Magnetic resonance imaging (MRI) datasets openly available to the research community.
  • http://adni.loni.usc.edu/ ADNI: Alzheimer’s Disease Neuroimaging Initiative (ADNI) researchers collect several types of data from volunteer study participants. The data is available for free to authorized investigators, but requires an application and prior approval.

Genome Datasets

Hospital Datasets

  • https://hcup-us.ahrq.gov/databases.jsp Healthcare Cost and Utilization Project (HCUP): Datasets contain encounter-level information on impatient stays, emergency department visits, and ambulatory surgery in US hospitals.
  • https://mimic.physionet.org/ MIMIC Critical Care Database: MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising unidentified health data associated with approximately 40,000 critical care patients. The dataset includes demographics, vital signs, laboratory tests, medications, and more.

Cancer Datasets

XAI evaluation datasets

  • Evaluation of post-hoc XAI approaches through synthetic tabular data. Tritscher, Julian; Ring, Markus; Schlör, Daniel; Hettinger, Lena; Hotho, Andreas in International Symposium on Methodologies for Intelligent Systems (2020).
    https://www.informatik.uni-wuerzburg.de/datascience/projects/deepscan/xai-eval-data/
  • KANDINSKYPatterns – An experimental exploration environment for Pattern Analysis and Machine Intelligence. Holzinger, A., Saranti, A. & Mueller, H. (2021).