July, 7, 2015 Seminar Metabolomics data types

The potential of metabolomics and its various data types

Lecturer: Natalie BORDAG,  CBmed – Center for Biomarker Research in Medicine Graz

Abstract: Metabolomics is one of the youngest -omics technologies primarily concerned with the identification and quantification of small molecules (<1500 Da). The specific advantage of metabolomics in biomarker research lies in the concept, that metabolites fall downstream of genetic, transcriptomic, proteomic, microbiomic and environmental variation, thus providing the most integrated and dynamic measure of phenotype and medical condition. Thus metabolomics can deliver biologically most valuable results allowing for example early diagnostic biomarkers, optimization of biotechnological productions, gaining deep insights into pathological mechanism, identifying new therapeutic targets and many more. Metabolomics, especially MS (mass spectrometry) based metabolomics, delivers along a the flow from measurement towards knowledge generation highly divers data types with most potential yet to be exploited. The biological potential for knowledge generation by metabolomics will be shown with a real life example. The different data types and common data aggregation (e.g. peak detection, identification), transformations, statistical analysis and visualizations will be shown and open potentials jointly discussed.

July, 7, 2015 Seminar Feature Based Search

Visual-Interactive Search and Exploration in Complex Data Repositories
– Feature-Based Search, Applications and Research Challenges

Lecturer: Tobias SCHRECK, University of Konstanz and Graz University of Technology <link>

Abstract: Advances in data acquisition and storage technology are leading to the creation of large, complex data sets in many different domains including science, engineering or social media. Often, this data is of non-textual / non-spatial nature. Important user tasks for leveraging large complex data sets include retrieval of relevant information, exploration for patterns and insights, and re-using data for authoring purposes. User-oriented, effective and scalable approaches are needed to support these tasks. Visual-interactive techniques in combination with automatic data analysis approaches can provide effective user interfaces for handling large, complex data sets, and help users to factor in background knowledge for solving search and analysis tasks. We will discuss approaches for visual-interactive, content-based search and analysis tasks in time-oriented and multivariate data sets, with applications in Digital Data Libraries. We will discuss how sketch-and example-based search interfaces allow to effectively formulate user queries, and how appropriate similarity functions for these data types can be defined and evaluated. We will also discuss approaches for visual-interactive search in 3D model repositories. Furthermore, we will present approaches for the repair of 3D models of deteriorated Cultural Heritage objects, relying on appropriate feature-based 3D similarity functions. We conclude this talk with a discussion of interesting research challenges at the intersection of visual data analysis, novel non-textual data types, and applications in Digital Libraries.


Natalie Bordag and Tobias Schreck as guests at the Holzinger Group

Machine Learning in Nature again

Lecun, Y., Bengio, Y. & Hinton, G. 2015. Deep learning. Nature, 521, (7553), 436-444.

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

More information: http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html

Nature Issue 7553 contains a special about computational intelligence!




June, 23, 2015 Seminar Talk Machine Learning

Title: Towards Knowledge Discovery with the human in the machine learning loop: An Ontology-Guided Meta-Classifying Approach for the Biomedical Domain

Lecturer: Dominic GIRADI, RISC-Software Linz, Austria <expertise>

Abstract: The process of knowledge discovery in clinical research is significantly different from other business domains, for example market research. While in the general definitions of knowledge discovery the domain expert is in a rather consulting, supervising or customer-like role, the complex process of (bio-) medical or clinical knowledge discovery requires the medical domain expert to be deeply involved into this process. At the same time, data integration and data pre-processing are known to be major pitfalls to such (bio-) medical data projects, due to the fact that in the (bio-) medical domain we are confronted with extremely high complexity, heterogeneity, along with unprecedented amounts of data sets. In this lecture it will be discussed what consequences for the knowledge discovery process arise, when the domain expert is moved to a central position of this process, and as a consequence how advanced machine learning algorithms can be combined with traditional, ontology-centered approaches for the benefit of advancing (bio-)medical research. Examples are given of different medical research projects, i.e.: clinical benchmarking, cerebral aneurysm and biometric study of children and young adults.
The theoretical focus of this talk is on how the elaborate structural meta-information of the domain ontology can be used to parametrize and automatize advanced machine learning algorithms and data visualization methods. Two examples will be presented: An ontology-guided dimensionality reduction with focus on the hierarchical structured, multi-select categorical variables and an approach of an ontology-guided meta-classifier.


May,19, 2015 Seminar Talks Machine Learning

Title: Towards Personalization of Diabetes Therapy Using Computerized Decision Support and Machine Learning

Lecturer: Klaus DONSA <expertise> and Stephan SPAT <expertise>

Abstract: Diabetes mellitus (DM) is a growing global disease which highly affects the individual patient and represents a global health burden with financial impact on national health care systems. The therapeutic options include lifestyle changes such as change of diet and an increase of physical activity, but also administration of oral or injectable antidiabetic drugs. The diabetes therapy, especially with insulin, is complex. Therapy decisions include various medical and life-style related information. Computerized decision support systems (CDSS) aim to improve the treatment process in patient´s self-management but also in institutional care. Therefore, the personalization of the patient´s diabetes treatment is possible at different levels and is also facilitated by using new therapy aids like food and activity recognition systems, lifestyle support tools and pattern recognition for insulin therapy optimization. In this talk we discuss the role of machine learning in this context. Furthermore we provide insights in different strategies to personalize diabetes therapy and how CDSS can support the therapy process. During our work we found open problems and challenges for the personalization of diabetes therapy. In a final discussion we will address these open problems with focus on decision support systems and especially machine learning technology.

Apr, 14, 2015 Seminar Talks Deep Learning

Title:  Using Deep Learning for Discovering Knowledge from Images: Pitfalls and Best Practices

Lecturer: Marcus BLOICE <expertise>

Abstract: Neural networks have been shown to be adept at image analysis and image classification. Deep layered neural networks especially so. However, deep learning requires two things in order to work proficiently: large amounts of data and lots of processing power. In this talk both aspects are covered, allowing you to maximise the potential of deep learning. Firstly, we will learn how the computational power of GPUs can be used to speed up learning by orders of magnitude, making it possible to learn from very large datasets on commodity hardware. Thanks to software such as Theano, Caffe, and Pylearn2, the GPU can be leveraged without needing to be an expert in parallel programming. This talk will discuss how. Secondly, data preprocessing, data augmentation, and artificial data generation are discussed. These methods allow you to ensure you are making the most of the data you possess, by expanding your dataset and preparing your data properly before analysis. This means discussing best practices in data preparation, using methods such as histogram equalisation, contrast stretching or normalisation, and discussing artificial data generation in detail. The tools you require to do so are described, using multi-platform software that is freely available. Finally, the talk will touch on hyper-parameters and the best practices and pitfalls of hyper-parameter choice when training deep neural networks.

Title: Pitfalls for applying Machine Learning in HCI-KDD: Things to be aware of and how to avoid them

Lecturer: Christof STOCKER <expertise>

Abstract: When dealing with big and unstructured data sets, we often try to be creative and to experiment with a number of different approaches for the purpose of knowledge discovery. This can lead to new insights and even spark novel ideas. However, ignorant application of algorithms to unknown data is dangerous and can lead to false conclusions – with high statistical significance. In finite data sets, structure can emerge from sheer randomness. Furthermore, hidden variables can lead to significant correlations that in turn might result in wrong conclusions. Beyond this, data science as a discipline has developed into a complex area in which mistakes can occur with ease and even lead experienced scientists astray. In this talk we will investigate these pitfalls together on simple examples and discuss how we can address these concerns with manageable effort.



Open PhD machine learning

PhD position in “Biomedical data sciences and machine learning” + 2 open MSc positions
in the context of the new competence center for biomarker discovery cbmed.org located at the Medical University Graz.

… have a MSc related to Information & Computer Science (e.g. Informatics, Software Engineering, Telematics, Mathematics, …)
… are eligible to enroll in the Doctoral School Computer Science at Graz University of Technology
… are interested to work within the hci-kdd.org group embedded in the international research community
… have experiences and interest in scientific work in the international context
… have a high interest in the topics data science and machine learning
… like undertaking theoretical, algorithmical, and experimental machine learning studies
… want to understand the problem of knowledge discovery from complex high-dimensional data sets

… are offering a PhD position (30 hours per week, 2100 Euro gross per month, 14 x, FWF salary) available immediately
(no closing date, the position will be filled when the ideal candidate has been found)
… a contract for four years, with opportunities to further develop into a PostDoc position with another four years
… do research in information integration in the life sciences, particularly in the integration of multiple heterogeneous data sources (e.g., -omics data, text data, image data, etc.) constituting the foundation for further machine learning based data analytics for biomarker discovery. Selected topics you have to deal with at the beginning include the research of how to integrate and analyse available data sources in the biomedical domain, a common representation and information fusion model of heterogeneous data sets and to develop and test model-based infrastructures for information integration and fusion
… are offering a workplace within the vibrant, beautiful and student friendly city of Graz in charming Austria

you are interested and motivated, please prepare
… a) your scientific résumé,
… b) a sample paper, and
… c) a research statement about your targeted scientific work within the four years (a PhD proposal)
by using the templates which you find here

and send it in one single pdf file directly to a.holzinger@hci-kdd.org

We are looking forward to welcome you in our group!

Geometric, Topological and Harmonic Trends to Image Processing due to 1st June 2015

Special Issue on Geometric, Topological and Harmonic Trends to Image Processing

Pattern Recognition Letters

Submission deadline: June 1, 2015

Advanced topological measures from the numerical and algebraic perspective, combined with the geometric representations of physical objects and the sparse decomposition using harmonic transforms are generating novel methods for the study of n-dimensional digital or continuous images. The mutual interdependence between harmonic analysis, geometry and topology supports the thesis that these different sources of mathematical information are necessary to fully characterize the spatially structured clouds of points at any dimension. In this special issue, the focus will be on novel methods of multi-dimensional and multi-variate image analysis and image processing using computational harmonic or geometric-topological techniques and algorithms.

The applications envisaged are in multidisciplinary engineering, paying particular attention to recent trends in the industrial setting and in any image-related topic situated at the interplay between these computational areas.

Main Topics of Interest:

Use of of harmonic analysis, topological and/or geometric information in image applications.
Computational harmonic analysis, topology or geometry applied to image processing;
Interactions between computational harmonic analysis, geometry and topology in image context;
Geometric and/or harmonic modeling guided by topological constraints;
Algorithm optimization for image applications, transfer of mathematical tools, parallel computation in image context and hierarchical approaches;
Pattern recognition from a harmonic, topological and/or geometrical viewpoint.
Combinatorial, geometric, topological, fractal or multi-resolution models.
Algebraic-topological and/or geometric invariants and features for n-dimensional images and their computation.

Submission Information:

See detailed Guide for Authors here: http://www.elsevier.com/journals/pattern-recognition-letters/0167-8655/guide-for-authors Papers can have a maximum length of 10 pages in the journal template.

Submit your paper here: http://ees.elsevier.com/patrec. Make sure to select ” SI: GeToHa” as the Article type. Submission is possible starting from May 1 2015. Submission deadline is June 1th, 2015

Papers will be reviewed according to the normal journal standards. Papers will receive at most two rounds of reviews. We will strive to finish the first round of review four to six weeks after submission.

For more information, please contact the Managing Guest Editor.

Pedro Real, Managing Guest Editor
Institute of Mathematics of Seville University (IMUS)
ETS. Ingeniería Informática, University of Seville, Spain


Darian Onchis Moaca, Guest Editor
Eftimie-Murgu University, Romania

Helena Molina-Abril, Guest Editor
The Maimonides Institute for Biomedical Research of Cordoba (IMIBIC), Spain

Mihail Gaianu, Guest Editor
West University of Timisoara, Romania

The future is in Open Data Sets

The idea of “open data” is not new. Many researchers in the past had followed the notion that Science is a public enterprise and that certain data should be openly available [1] and it is recently also a big topic in the biomedical domain [2], [3]; e.g.. the British Medical Journal (BMJ) started a big open data campaign [4]. The goal of the movement is similar to approaches of open source, open content or open access. With the launch of open data government initiatives the open data movement gained momentum [5] and some speak already about an Open Knowledge Foundation [6]. Consequently, there are plenty of research challenges on this topic. Cancer research, for example, could dramatically benefit from science without any boundaries.

[1]   L. Rowen, G. K. S. Wong, R. P. Lane, and L. Hood, “Intellectual property – Publication rights in the era of open data release policies,” Science, vol. 289, pp. 1881-1881, Sep 2000.

[2]  G. Boulton, M. Rawlins, P. Vallance, and M. Walport, “Science as a public enterprise: the case for open data,” The Lancet, vol. 377, pp. 1633-1635, // 2011.

[3]   A. Hersey, S. Senger, and J. P. Overington, “Open data for drug discovery: learning from the biological community,” Future Medicinal Chemistry, vol. 4, pp. 1865-1867, Oct 2012.

[4]  M. Thompson and C. Heneghan, “BMJ OPEN DATA CAMPAIGN We need to move the debate on open clinical trial data forward,” British Medical Journal, vol. 345, Dec 2012.

[5]  N. Shadbolt, K. O’Hara, T. Berners-Lee, N. Gibbins, H. Glaser, W. Hall, et al., “Open Government Data and the Linked Data Web: Lessons from data. gov. uk,” IEEE Intelligent Systems, pp. 16-24, 2012.

[6]   J. C. Molloy, “The Open Knowledge Foundation: Open Data Means Better Science,” Plos Biology, vol. 9, Dec 2011.

Here are some sample data sets:

[7] 1000 Genomes: A deep catalog of human genetic variation. The projects sequenced the genomes of a large number of people in order to provide a comprehensive resource on human genetic variation. It contains about 2,500 samples from 2010 and 2011:

1000 Genomes Project Consortium and others. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), 1061-1073.

[8] Tiny Images dataset: The data set consists of over 79 million images in color. They are stored in a 227 Gb binary file. A Matlab toolbox to access the images is provided. Automatic annotation data is available for all images, but manual annotation data is only available for a smaller portion:

A. Torralba and R. Fergus and W. T. Freeman. 2008. 80 Million Tiny Images: a Large Database for Non-Parametric Object and Scene Recognition. IEEE PAMI, 30 (11), 1958-1970.

[9] Just to make ones familiar with the abundance of different skin diseases, a very informative collection of skin images, provided by Healthline.com: http://www.healthline.com/health/skin-disorders

[10]  Breast Tumor (gene expression) data of Van’t Veer (2002): The training data set consists of 78 primary breast cancers of which 34 patients developed metastasses within 5 years. The training set contains 19 breast cancer patients of which 12 developed metastases within 5 years. The data contains 24188 gene expression levels. The general goal is predicting metastases for improving the therapy strategy:

Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. & Witteveen, A. T. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, (6871), 530-536. 

[11] Machine Learning Repository data sets from the Center for Machine Learning and Intelligent Systems, University of California, maintains 313 data sets as a service to the machine learning community:

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[12] Data.Medicare.gov: Data sets from Medicare.gov for downloading, exploring, and visualizing. Direct access to data sets, including data sets from hospitals, nursing homes, physicians, homes, supplierers and other facilities is provided. The data gives general information about the quality of care in these facilities:

[13] re3data.org: a Registry of Research Data Repositories. Research data repositories from different academic disciplines are featured here. The projects promotes a culture of sharing between researchers. It started in 2012 and is funded by the German Research Foundation:

Pampel H, Vierkant P, Scholze F, Bertelmann R, Kindling M, et al. 2013. Making Research Data Repositories Visible: The re3data.org Registry. PLoS ONE, 8 (11). 

[14] Time series data as a sequence of point sets collected over a time intervall are widely used, e.g. in biomedicine (heart rate, ECG, EEG, etc.), but also in many other fields e.g. in astronomy or eartyquake prediction. The University of California Riverside (UCR) Time Series Classification and Clustering Collection has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering:

Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7, (4), 349-371.

[15] The MNIST database of handwritten digits includes a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image:

Liu, C. L., Nakashima, K., Sako, H. & Fujisawa, H. 2003. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36, (10), 2271-2285.

[16] KONECT (the Koblenz Network Collection) gathers large network datasets of various types. The over 200 open datasets are collected by the Institute of Web Science and Technologies at the University of Koblenz-Landau.

Kunegis, Jérôme (2013). KONECT – The Koblenz Network Collection. Proc. Int. Conf. on World Wide Web Companion, pages 1343-1350.

[17] Kaggle offeres competitions and thus provides many different kinds of real-world open data for scientists.

[18] CKAN serves as a data management tool used by organizations, research institutions and governments since 2006. It has been developed by the Open Knowledge Foundation.

[19] The goal of healtdata.gov is to make health data more accessible for research. It contains ovre 1800 datasets at the moment.

[20] Socrata is a cloud software company which also provides open datasets of many different topics.


Machine Learning in Nature

Apart from occassional news entries, comptuer science rarely makes it into Nature. A quick count in the Web of Science results in 33 articles, the last one – a year ago – by Ekert, A. & Renner, R. 2014. The ultimate physical limits of privacy. Nature, 507, (7493), 443-447, and the most prominent one surely the one with 3,200 citations by Strogatz, S. H. 2001. Exploring complex networks. Nature, 410, (6825), 268-276.

Now, machine learning has made it into Nature: The group of DeepMind Technologies founded by Demis Hassabis in 2011 as a start-up company, and purchased by Google for approx. 400 Million USD in 2014, has published a paper, which appeared today, 26.02.2015:

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. & Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature, 518, (7540), 529-533.

Abstract: The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

Subject terms: Computer Science

The editors summary: For an artificial agent to be considered truly intelligent it needs to excel at a variety of tasks considered challenging for humans. To date, it has only been possible to create individual algorithms able to master a single discipline — for example, IBM’s Deep Blue beat the human world champion at chess but was not able to do anything else. Now a team working at Google’s DeepMind subsidiary has developed an artificial agent — dubbed a deep Q-network — that learns to play 49 classic Atari 2600 ‘arcade’ games directly from sensory experience, achieving performance on a par with that of an expert human player. By combining reinforcement learning (selecting actions that maximize reward — in this case the game score) with deep learning (multilayered feature extraction from high-dimensional data — in this case the pixels), the game-playing agent takes artificial intelligence a step nearer the goal of systems capable of learning a diversity of challenging tasks from scratch.

More information: http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html#tables