The future is in Open Data Sets

The idea of “open data” is not new. Many researchers in the past had followed the notion that Science is a public enterprise and that certain data should be openly available [1] and it is recently also a big topic in the biomedical domain [2], [3]; e.g.. the British Medical Journal (BMJ) started a big open data campaign [4]. The goal of the movement is similar to approaches of open source, open content or open access. With the launch of open data government initiatives the open data movement gained momentum [5] and some speak already about an Open Knowledge Foundation [6]. Consequently, there are plenty of research challenges on this topic. Cancer research, for example, could dramatically benefit from science without any boundaries.

[1]   L. Rowen, G. K. S. Wong, R. P. Lane, and L. Hood, “Intellectual property – Publication rights in the era of open data release policies,” Science, vol. 289, pp. 1881-1881, Sep 2000.

[2]  G. Boulton, M. Rawlins, P. Vallance, and M. Walport, “Science as a public enterprise: the case for open data,” The Lancet, vol. 377, pp. 1633-1635, // 2011.

[3]   A. Hersey, S. Senger, and J. P. Overington, “Open data for drug discovery: learning from the biological community,” Future Medicinal Chemistry, vol. 4, pp. 1865-1867, Oct 2012.

[4]  M. Thompson and C. Heneghan, “BMJ OPEN DATA CAMPAIGN We need to move the debate on open clinical trial data forward,” British Medical Journal, vol. 345, Dec 2012.

[5]  N. Shadbolt, K. O’Hara, T. Berners-Lee, N. Gibbins, H. Glaser, W. Hall, et al., “Open Government Data and the Linked Data Web: Lessons from data. gov. uk,” IEEE Intelligent Systems, pp. 16-24, 2012.

[6]   J. C. Molloy, “The Open Knowledge Foundation: Open Data Means Better Science,” Plos Biology, vol. 9, Dec 2011.

Here are some sample data sets:

[7] 1000 Genomes: A deep catalog of human genetic variation. The projects sequenced the genomes of a large number of people in order to provide a comprehensive resource on human genetic variation. It contains about 2,500 samples from 2010 and 2011:
https://www.1000genomes.org/ftpsearch

1000 Genomes Project Consortium and others. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-1073.

[8] Tiny Images dataset: The data set consists of over 79 million images in color. They are stored in a 227 Gb binary file. A Matlab toolbox to access the images is provided. Automatic annotation data is available for all images, but manual annotation data is only available for a smaller portion:

A. Torralba and R. Fergus and W. T. Freeman. 2008. 80 Million Tiny Images: a Large Database for Non-Parametric Object and Scene Recognition. IEEE PAMI, 30 (11), 1958-1970.

[9] Just to make ones familiar with the abundance of different skin diseases, a very informative collection of skin images, provided by Healthline.com: https://www.healthline.com/health/skin-disorders

[10]  Breast Tumor (gene expression) data of Van’t Veer (2002): The training data set consists of 78 primary breast cancers of which 34 patients developed metastasses within 5 years. The training set contains 19 breast cancer patients of which 12 developed metastases within 5 years. The data contains 24188 gene expression levels. The general goal is predicting metastases for improving the therapy strategy:
https://www.stats.uwo.ca/faculty/aim/2015/9850/microarrays/FitMArray/chm/Veer.html

Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. & Witteveen, A. T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, (6871), 530-536. 

[11] Machine Learning Repository data sets from the Center for Machine Learning and Intelligent Systems, University of California, maintains 313 data sets as a service to the machine learning community:
https://archive.ics.uci.edu/ml/datasets.html

Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[12] Data.Medicare.gov: Data sets from Medicare.gov for downloading, exploring, and visualizing. Direct access to data sets, including data sets from hospitals, nursing homes, physicians, homes, supplierers and other facilities is provided. The data gives general information about the quality of care in these facilities:
https://data.medicare.gov/

[13] re3data.org: a Registry of Research Data Repositories. Research data repositories from different academic disciplines are featured here. The projects promotes a culture of sharing between researchers. It started in 2012 and is funded by the German Research Foundation:
https://www.re3data.org/

Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, et al. 2013. Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE, 8 (11). 

[14] Time series data as a sequence of point sets collected over a time intervall are widely used, e.g. in biomedicine (heart rate, ECG, EEG, etc.), but also in many other fields e.g. in astronomy or eartyquake prediction. The University of California Riverside (UCR) Time Series Classification and Clustering Collection has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering:
https://www.cs.ucr.edu/~eamonn/time_series_data/

Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7, (4), 349-371.

[15] The MNIST database of handwritten digits includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image:
https://yann.lecun.com/exdb/mnist/

Liu, C. L., Nakashima, K., Sako, H. & Fujisawa, H. 2003. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36, (10), 2271-2285.

[16] KONECT (the Koblenz Network Collection) gathers large network datasets of various types. The over 200 open datasets are collected by the Institute of Web Science and Technologies at the University of Koblenz-Landau.

Kunegis, Jérôme (2013). KONECT – The Koblenz Network Collection. Proc. Int. Conf. on World Wide Web Companion, pages 1343-1350.

[17] Kaggle offeres competitions and thus provides many different kinds of real-world open data for scientists.
https://www.kaggle.com/

[18] CKAN serves as a data management tool used by organizations, research institutions and governments since 2006. It has been developed by the Open Knowledge Foundation.
https://datahub.io/

[19] The goal of healtdata.gov is to make health data more accessible for research. It contains ovre 1800 datasets at the moment.
https://www.healthdata.gov/

[20] Socrata is a cloud software company which also provides open datasets of many different topics.
https://opendata.socrata.com/