Machine Learning for Health Informatics
Machine learning is the most growing field in computer science [Jordan, M. I. & Mitchell, T. M. 2015. Machine learning: Trends, perspectives, and prospects. Science, 349, (6245), 255-260], and it is well accepted that health informatics is amongst the greatest challenges [LeCun, Y., Bengio, Y. & Hinton, G. 2015. Deep learning. Nature, 521, (7553), 436-444 ].
Machine learning is a very broad and rapidly developing subfield of computer science that evolved from artificial intelligence (AI) and is tightly connected with knowledge discovery and data mining (KDD). The ultimate goal of machine learning is to design and develop algorithms which can learn from data. Consequently, machine learning systems learn and improve with experience over time and their trained models can be used to predict outcomes of questions based on previously seen knowledge. In fact, the process of learning intelligent behaviour from noisy examples is one of the most pressing questions in AI. The ability to learn from noisy, complex, high dimensional data sets is highly relevant for many applications in the health informatics domain. This is due to the inherent nature of biomedical data, and health will increasingly be the focus of machine learning research in the near future.
1) Automatic Machine Learning vs. interactive Machine Learning for Health Informatics [slides]
Head of the Holzinger Group HCI-KDD, Institute for Medical Informatics, Statistics and Documentation, Medical University Graz; Lead of machine learning at the K1 comptence center CBmed, and Graz University of Technology, Insitute of Information Systems and Computer Media; Program co-chair of the 14th IEEE Intl. Conference on Machine Learning and Applications, Assoc. Editor of Springer Knowledge and Information Systems, Founder and lead of the expert network HCI-KDD, and member of IFIG WG 12.9. Computational Intelligence
Abstract: The classic question in Machine Learning is: “How can algorithms be developed that can automatically improve with experience, and what are the fundamental laws that govern all learning processes?”. To this end, most researchers in the Machine Learning community concentrate on automatic Machine Learning (aML) approaches and enormous strides have been made this way in many domains, such as in speech recognition or in autonomous vehicles. Interactive Machine Learning (iML*) has its roots in three approaches: Reinforcement Learning (RL), Preference Learning (PL), and Active Learning (AL). However, the term itself is not yet well used; consequently we define iML-approaches as algorithms that can interact with both computational agents and human agents **) and can optimize their learning behaviour through these interactions. This is especially important in health informatics, where issues arise not only with big data, but often with small but complex data; or rare events, where traditional learning algorithms suffer due to insufficient training samples. Morevoer, they must deal with too high a dimensionality, hence are in danger of modelling artefacts. Here iML approaches can be of help, for example through a concept known as the “doctor-in-the-loop”, where human expertise can assist in solving hard problems. Take for example the k-Anonymisation of patient data, protein folding, or subspace clustering, where human intelligence can help to reduce an exponential search space through the heuristic selection of samples. Therefore, what would otherwise be an NP-hard problem, reduces greatly in complexity through the input and assistance of a human agent – involved in the learning phase, and influencing measures such as distance or cost functions. Finally, this talk fosters an integrated approach, i.e. for the successful application of machine learning algorithms in the health sciences, a comprehensive understanding of the data ecosystem and knowledge discovery pipeline is essential. This means that a multidisciplinary skill set is required, encompassing the following seven specialisations: 1) data science, 2) algorithms, 3) network science, 4) graphs/topology, 5) time/entropy, 6) data visualization, and 7) privacy, data protection, safety and security (see the HCI-KDD approach).
*) Holzinger, A. 2016. Interactive Machine Learning (iML). Informatik Spektrum. DOI: 10.1007/s00287-015-0941-6
and Holzinger, A. 2016. Interactive Machine Learning for Health Informatics: When do we need the human-in-the-loop? In: Springer Brain Informatics, Vol. 3, Issue 1, in print. preprint available here
**) In Active Learning such agents are referred to as “oracles” (see for example: Settles, B. 2011. From theories to queries: Active learning in practice. In: Guyon, I., Cawley, G., Dror, G., Lemaire, V. & Statnikov, A. (eds.) Active Learning and Experimental Design Workshop 2010. Sardinia: JMLR Proceedings. 1-18).
2) Julia for teaching machine learning: From Linear Regression to Soft-Margin Classifier using Julia
PhD Student at the Holzinger Group HCI-KDD
Abstract: Linear- and Logistic regression are among the most popular introduction topics into the field of machine learning; with good reason. These models are particularly useful to gradually introduce and discuss some of the core concepts of the field that we will need in the later talks that focus on deep learning. These concepts include various loss functions, under- and over-fitting, regularization, and gradient-based optimization. The goal of this talk is to give the participants a good intuition behind these topics, all of which we will discuss using code examples in Julia. Julia – see julialang.org – is an open source, high-level, high-performance dynamic programming language for scientific computing. It’s syntax is similar to other technical computing environments and will help us explore some of the core concepts of machine learning. Tutorial length: approximately 30 minutes.
3) GPU-driven deep learning analysis of melanocytic skin lesion image data using Caffe
PhD Student at the Holzinger Group HCI-KDD
Abstract: This tutorial will present the installation and usage of Caffe – see caffe.berkeleyvision.org – a popular deep learning framework developed by the Berkeley Vision and Learning Centre, University of California. It will first discuss how to obtain, compile, and run the Caffe software under Linux on a GPU-equipped workstation. We will then see how, through the use of a mid-range gaming GPU, training times can be reduced by a factor of 20 when compared to a CPU. Lastly, a concrete example of a deep learning task will be presented in the form of a live analysis of a confocal laser scanning microscopy dataset of skin lesion images, by training a model that automatically classifies these images into malignant and benign cases with a high degree of accuracy. Tutorial length: approximately 30 minutes.
4) Applied Machine Learning for text document classification in Python
PhD Student at the Holzinger Group and Researcher in Project 1.3 machine learning at the K1 Competence Center CBmed Center for Biomarker Research in Medicine
Abstract: Document classification is a challenging task in machine learning, especially due to the high dimensionality and sparse document term vectors. The goal of this tutorial, is to present a toolbox of machine learning techniques which can be applied to this problem. We will start with a simple bag of words model and a linear classifier, and iterativly improve the performance by introducing various preprocessing steps (e.g., TF-IDF, lemmanization, ngrams, word and document vectors, etc. ) and various machine learning techniques/models (e.g., Random Forests, SVC, ensemble methods, etc.). At the end of this tutorial, the audience will have an intuition of how to approach complex tasks with machine learning technologies.
Tutorial length: approximate 30 minutes.
5) Fast Saddle Point Algorithms for Variable Selection
Visiting Postdoc Researcher at the Holzinger Group HCI-KDD from the Lehrstuhl 8 Artificial Intelligence Group TU Dortmund
Abstract: In this talk we introduce saddle-point (SP) formulations and algorithms to solve variable selection problems. Comparing to the non-SP alternatives, SP algorithms do not require users to specify a sensitive parameter that can affect solution quality and/or algorithm performance. In particular, we propose a simple primal-dual proximal extragradient algorithm to solve the generalized Dantzig selector (GDS) estimation, based on a new convex-concave saddle-point (CCSP) formulation and a linear gradient extrapolation technique. We demonstrate our algorithm on variable selection problems: (1) biomarker selection from genetic data, and (2) network selection from integrated cancer data of multiple types from TCGA.
6) Fingerprinting for data leak detection – trust of the “doctor-in-the-loop”
Visiting Doctoral Researcher at the Holzinger Group HCI-KDD from Secure Business Austria SBA Research
Abstract: Exchanging research data is one of the main aspects for machine learning and the combination of different data sets is considered to be a major driver in data driven research projects, spanning all kinds of research fields including biomedical research, technological or sociological sciences. Combination of data sets not only allows the verification of results or the exploration of larger sets of data, it is also often valuable for enabling additional perspectives on known results. One of the main obstacles that obstructs the efficient sharing of data lies in the need for data leakage detection. While anonymization or pseudonymization techniques are used in order to protect private information, this is often not enough: The research data itself often constitutes a very valuable asset of the participating research institutions, data leakage is thus a very serious problem. In this talk we will present approaches for data leakage detection, specifically targeting the needs of data driven research. These approaches will combine the requirements of anonymization, as well as of fingerprinting in one single step and allow the detection of the leaking party based on the revelation of one single record.
7) Towards open Medical Data: k-Anonymisation as a NP-hard Problem
PhD Student at the Holzinger Group HCI-KDD
Abstract: The amount of patient-related data produced in today’s clinical setting poses many challenges with respect to collection, storage and responsible use. For example, in research and public health care analysis, data must be anonymized before transfer, for which the k-anonymity measure was introduced and successively enhanced by further criteria. As k-anonymity is an NP-hard problem, modern approaches often make use of heuristics based methods. This talk will convey the motivation for anonymization followed by an outline of its criteria, as well as a general overview of methods & algorithmic approaches to tackle the problem. As the resulting data set will be a tradeoff between utility and individual privacy, we need to optimize those measures to individual (subjective) standards. Moreover, the efficacy of an algorithm strongly depends on the background knowledge of a potential attacker as well as the underlying problem domain. I will therefore conclude the session by contemplating an interactive machine learning (iML) approach, pointing out how domain experts might get involved to improve upon current methods.