Machine Learning for Biomedicine
Date: Tuesday, 26th January 2016, Start: 10:00, End: 17:00; Venue: Graz University of Technology,
Institute of Computer Graphics and Knowledge Visualization CGV, hosted by Prof. Tobias SCHRECK
Address: Inffeldgasse 16c, A-8010 Graz <maps and directions>
Machine learning is the most growing field in computer science [Jordan, M. I. & Mitchell, T. M. 2015. Machine learning: Trends, perspectives, and prospects. Science, 349, (6245), 255-260], and it is well accepted that health informatics is amongst the greatest challenges [LeCun, Y., Bengio, Y. & Hinton, G. 2015. Deep learning. Nature, 521, (7553), 436-444 ].
Sucessful Machine Learning for Health Informatics requires a comprehensive understanding of the data ecosystem and a multi-disciplinary skill-set, from seven specializations: 1) data science, 2) algorithms, 3) network science, 4) graphs/topology, 5) time/entropy, 6) data visualization and visual analytics, and 7) privacy, data protection, safety and security – as supported by the international expert network HCI-KDD.
10:00-10:10 Welcome, Opening Remarks and Introduction to the Workshop Topics
About: Tobias is the new Professor of the Computer Graphics and Knowledge Visualization Institute (CGV) at the Faculty of Informatics and Biomedical Engineering of Graz University of Technology and the Deputy Head of the Instiute. Tobias is an expert in data visualization, visual analytics and visaul search.
10:10-10:40 Visual Parameter Space Analysis
Abstract: Selecting and parameterizing models and algorithms is a widespread challenge across science, engineering, and business. Currently, these processes are largely based on a user’s intuition, leading to tedious trial and error approaches. In this talk, I will introduce and advocate a different approach, visual parameter space analysis, which seeks to make these processes more systematic and transparent. To do so, first, a broad set of parameterizations is sampled, and respective outputs are generated. Through thoroughly designed visual interfaces, the user can then compare and analyze the space of possible outputs, and make more informed choices. Along with a set of case studies, I will introduce a conceptual framework for visual parameter space analysis that can be used to guide designing respective tools.
About: Michael is an I am an assistant professor at the University of Vienna, Department of Computer Science, Visualization and Data Analysis Group (VDA), he is working in data visualization, human-computer interaction and theoretical foundations of visualization and partiuclarly interested in dimensionality reduction.
10:40 – 11:10 Subspace Clustering with the doctor in the loop
Abstract: Exploring patterns in datasets with a large amount of attributes (high-dimensional data) is challenging due to the so-called curse of dimensionality. Noise, irrelevant, redundant, or contradicting information may lead to distorted analysis results. Furthermore, computational concepts such as distance functions do often not represent a useful similarity measure as many objects are only similar to each other in a subset of its dimensions. In this talk, I will motivate to analyze data not by considering all available dimensions, but rather to identify different analysis results in different relevant subsets of dimensions (subspaces). In particular, I will describe subspace clustering and subspace nearest neighbor search as two means to identify patterns in subspaces of high-dimensional data. I will compare both methods to its full-space counterparts, describe its advantages and drawbacks, and show the necessity of visualizations and domain-experts to make sense of detected patterns.
About: Michael [https://www.vis.uni-konstanz.de/en/members/hund/] is a PhD student at the Chair for Data Analysis and Visualization [https://www.vis.uni-konstanz.de/en] of Prof. Daniel Keim at the University of Konstanz, Germany. His research interests are in data visualization, high-dimensional data analysis, and visual analytics for subspace analysis.
11:10-11:40 Electronic Health Records and the doctor-in-the-loop
Abstract: The development and implementation of information and communication technology (ICT) systems and electronic Health Records (eHRs) in hospitals and Primary Health Care (PHC) in Europe has been proven to be a challenging process. In the beginning of this era, the primary objective was to transfer paper-based to digital recording systems, to improve health administration and reduce the extensive use of paper. In the last decade, the volume of knowledge and the medication treatment options in clinical medicine have rapidly grown. This was a strong stimulus for further expansion of ICT and eHRs use in healthcare. The focus has been placed on the improvement of patient care workflows and healthcare system management and on reporting of medication adverse reactions. The recent incentive for this development came together with the explosion of technology-driven science, such as genomics and other high throughput technologies, that emphasises the need for integration of data from molecular biology and other advanced technologies in medicine with information on patients demographics and the health-related status. This requirement reconciled the old idea on the potentials of eHRs for secondary use, beyond the benefits in clinical care, but for performing biomedical research. This idea is further boosted by the recent technology advancements in healthcare, such as the use of biosensors and mobile applications, which all offers the enormous possibilities, not only for patient care improvement, but also for clinical and translational research. On the other hand, there are still many obstacles for retrieving data from eHRs for its exploitation in research. The most challenging issues are the incompletnes of data and the lack of its interoperability for the dual use. The role of a medical expert (a doctor-in the-loop) in solving this task is of the utmost importance. Initiatives that are underway for re-purposing data form eHRs, to be appropriate for use in clinical research, are however optimistic and are expected to gain benefits in this direction soon.
About: Ljiliana is a MD at the University Hospital Osijek in Croatia, is a specialist of family medicine and Professor in the Deparment of Family Medicine and Internal Medicine, School of Medicine, University JJ Strossmayer. She has also completed a postgraduate study in clinical immunology and allergology, School of Medicine, University of Zagreb.
13:00 – 13:30 Machine Learning for Biomedical Informatics: When do we need the doctor-in-the-loop?
Abstract: Abstract: The goal of ML is to develop algorithms which can learn and improve over time. Most ML researchers concentrate on automatic machine learning (aML), where great advances have been made, for example, in speech recognition, recommender systems, or autonomous vehicles. Automatic approaches greatly benefit from big data with many training sets. In the biomedical domain, we are often confronted with a small number of data sets, where aML-approaches suffer of insufficient training samples. Here, interactive Machine Learning (iML) may be of help, having its roots in reinforcement, preference and active learning. However, the term iML is not yet well used, so we define it as “algorithms that can interact with agents and can optimize their learning behaviour through these interactions, where the agents can also be human”. A “doctor-in-the-loop” can be beneficial in solving computationally hard problems, e.g., subspace clustering, protein folding, or k-anonymization of health data, where human expertise can help to reduce an exponential search space through heuristic selection of samples. Therefore, what would otherwise be an NP-hard problem reduces greatly in complexity through the input and the assistance of a human agent involved in the learning phase. However, for the successful application of ML in the biomedical domain a multidisciplinary skill set is required, encompassing the following seven specializations: 1) data science, 2) algorithms, 3) network science, 4) graphs/topology, 5) time/entropy, 6) data visualization, and 7) privacy, data protection, safety and security, fostered in the HCI-KDD approach.
About: Andreas is head of the Holzinger Group HCI-KDD, Institute for Medical Informatics, Statistics and Documentation, Medical University Graz; Lead of machine learning at the K1 comptence center CBmed, and Graz University of Technology, Insitute of Information Systems and Computer Media; founder and leader of the expert network HCI-KDD.
13:30 – 14:00 Fingerprinting for data leak detection – trust of the “doctor-in-the-loop”
Abstract: Exchanging research data is one of the main aspects for machine learning and the combination of different data sets is considered to be a major driver in data driven research projects, spanning all kinds of research fields including biomedical research, technological or sociological sciences. Combination of data sets not only allows the verification of results or the exploration of larger sets of data, it is also often valuable for enabling additional perspectives on known results. One of the main obstacles that obstructs the efficient sharing of data lies in the need for data leakage detection. While anonymization or pseudonymization techniques are used in order to protect private information, this is often not enough: The research data itself often constitutes a very valuable asset of the participating research institutions, data leakage is thus a very serious problem. In this talk we will present approaches for data leakage detection, specifically targeting the needs of data driven research. These approaches will combine the requirements of anonymization, as well as of fingerprinting in one single step and allow the detection of the leaking party based on the revelation of one single record.
About: Peter is visiting doctoral researcher at the Holzinger Group HCI-KDD from Secure Business Austria SBA Research. Peter repeatedly organizes the IWSMA and is currently first vice chair of the joint IEEE-chapter CS/SMC in Austria. His main interest is in privacy aware machine learning.
14:00-14.30 Towards open Medical Data: k-Anonymisation as a NP-hard Problem
Abstract: The amount of patient-related data produced in today’s clinical setting poses many challenges with respect to collection, storage and responsible use. For example, in research and public health care analysis, data must be anonymized before transfer, for which the k-anonymity measure was introduced and successively enhanced by further criteria. As k-anonymity is an NP-hard problem, modern approaches often make use of heuristics based methods. This talk will convey the motivation for anonymization followed by an outline of its criteria, as well as a general overview of methods & algorithmic approaches to tackle the problem. As the resulting data set will be a tradeoff between utility and individual privacy, we need to optimize those measures to individual (subjective) standards. Moreover, the efficacy of an algorithm strongly depends on the background knowledge of a potential attacker as well as the underlying problem domain. I will therefore conclude the session by contemplating an interactive machine learning (iML) approach, pointing out how domain experts might get involved to improve upon current methods.
About: Bernd is a PhD Student at the Holzinger Group HCI-KDD and is currently working on the topic of the production of open data sets.
14:30-15:00 GPU-driven deep learning analysis of melanocytic skin lesion image data using Caffe
PhD Student at the Holzinger Group HCI-KDD
Abstract: This tutorial will present the installation and usage of Caffe – see caffe.berkeleyvision.org – a popular deep learning framework developed by the Berkeley Vision and Learning Centre, University of California. It will first discuss how to obtain, compile, and run the Caffe software under Linux on a GPU-equipped workstation. We will then see how, through the use of a mid-range gaming GPU, training times can be reduced by a factor of 20 when compared to a CPU. Lastly, a concrete example of a deep learning task will be presented in the form of a live analysis of a confocal laser scanning microscopy dataset of skin lesion images, by training a model that automatically classifies these images into malignant and benign cases with a high degree of accuracy. Tutorial length: approximately 30 minutes.
15:00-15:30 Towards Deep Learning on Text
PhD Student at the Holzinger Group HCI-KDD
Abstract: The task of automatically assigining categories/labels (e.g., diseases) to text documents, is called document classification. Due to the massive amount of narrative text documents produced in the clinical setting, there exist various interesting applications for document classification in this area. The goal of this tutorial, is to present a toolbox of machine learning techniques which can be applied to this problem. From the perspective of machine learning, document classification is a challenging task, especially due to the high dimensionality and sparse document term vectors. Apart from the traditional bag of words representations, we will focus on word vector feature representations and illustrate on various machine learning models howto classify documents into categories. A special issue one has to face when using word vectors , is the variable size of input documents. Therefore, we will cover also sequencial machine learning models, like recurrent neural networks (RNNs), to deal with this issue. At the end of this tutorial, the audience will have an intuition of how to approach document classification with machine learning technologies.
16:00-16:30 Fast Saddle Point Algorithms for Variable Selection
Visiting Postdoc Researcher at the Holzinger Group HCI-KDD from the Lehrstuhl 8 Artificial Intelligence Group TU Dortmund
Abstract: In this talk we introduce saddle-point (SP) formulations and algorithms to solve variable selection problems. Comparing to the non-SP alternatives, SP algorithms do not require users to specify a sensitive parameter that can affect solution quality and/or algorithm performance. In particular, we propose a simple primal-dual proximal extragradient algorithm to solve the generalized Dantzig selector (GDS) estimation, based on a new convex-concave saddle-point (CCSP) formulation and a linear gradient extrapolation technique. We demonstrate our algorithm on variable selection problems: (1) biomarker selection from genetic data, and (2) network selection from integrated cancer data of multiple types from TCGA.
16:30-17:30 Tutorial: Julia’s potential for research and education on the example of Machine Learning
PhD Student at the Holzinger Group HCI-KDD
Abstract: More and more researchers and companies embrace the spirit of Open Source. This is especially true for Machine Learning. When it comes to scientific computing, the community is split between various free and commercial environments. One such environment is known as Julia (see julialang.org), which is an open source, high-level, high-performance dynamic programming language for scientific computing. In this talk, we will introduce and motivate the Julia language by discussing its strengths and weaknesses. We will use Machine Learning as a common thread throughout this talk to help us explore Julia’s potential for research and education. For example, we will discuss the two language problem and how Julia could be an answer to this problem that is arguable associated with a lot of scientific computing environments. Tutorial length: approximately 60 minutes.