Google releases their Syntactic Parser Open Source

Google researchers spend a lot of time thinking about how computer systems can read and understand human language in order to process it in intelligent ways. On May, 12, 2016 Slav Petrov (expertise) based in New York and leading the machine learning for natural language group (Slav Petrov Page), announced that they released SyntaxNet as an open-source neural network framework implemented in TensorFlow that provides a new foundation for Natural Language Understanding (NLU) . The release includes all code needed to train new SyntaxNet models on own data, as well as Parsey McParseface, an English parser that the Googlers have trained and that can be used to analyze English text. Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence.

Read more:
https://googleresearch.blogspot.co.at/2016/05/announcing-syntaxnet-worlds-most.html

Literature:

Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S. & Collins, M. 2016. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042.

Petrov, S., Mcdonald, R. & Hall, K. 2016. Multi-source transfer of delexicalized dependency parsers. US Patent 9,305,544.

Weiss, D., Alberti, C., Collins, M. & Petrov, S. 2015. Structured Training for Neural Network Transition-Based Parsing. arXiv:1506.06158.

Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I. & Hinton, G. Grammar as a foreign language. Advances in Neural Information Processing Systems, 2015. 2755-2763.

Deep Learning Playground openly available

TensorFlow – part of the Google brain project – has recently open sourced on GitHub a nice playground for testing and learning the behaviour of deep learning networks, which also can be used following the Apache Licence:

https://playground.tensorflow.org

Background: TensorFlow is an open source software library for machine learning. There is a nice video “large scale deep learning” by Jeffrey Dean.  TensorFlow is  an interface for expressing machine learning algorithms along with an implementation for executing such algorithms on a variety of heterogeneous systems, ranging from smartphones to high-end computer clusters and  grids of thousands of computational devices (e.g. GPU). The system has been used for research in various areas of computer science (e.g. speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, computational drug discovery). The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license on 9th November 2015 and is available at www.tensorflow.org

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J. & Devin, M. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467.

It is also discussed on episode 24 of talking machines.

 

 

Human-in-the-Loop

Interactive machine learning for health informatics: when do we need the human-in-the-loop?

Machine learning (ML) is the fastest growing field in computer science, and health informatics is among the greatest challenges. The goal of ML is to develop algorithms which can learn and improve over time and can be used for predictions. Most ML researchers concentrate on automatic machine learning (aML), where great advances have been made, for example, in speech recognition, recommender systems, or autonomous vehicles. Automatic approaches greatly benefit from big data with many training sets. However, in the health domain, sometimes we are confronted with a small number of data sets or rare events, where aML-approaches suffer of insufficient training samples. Here interactive machine learning (iML) may be of help, having its roots in reinforcement learning, preference learning, and active learning. The term iML is not yet well used, so we define it as “algorithms that can interact with agents and can optimize their learning behavior through these interactions, where the agents can also be human.” This “human-in-the-loop” can be beneficial in solving computationally hard problems, e.g., subspace clustering, protein folding, or k-anonymization of health data, where human expertise can help to reduce an exponential search space through heuristic selection of samples. Therefore, what would otherwise be an NP-hard problem, reduces greatly in complexity through the input and the assistance of a human agent involved in the learning phase. Most of all the human in the loop can bring in conceptual knowledge, “intuition”, expertise and explicit knowledge which current AI is completely lacking!

We define iML-approaches as algorithms that can interact with both computational agents and human agents *) and can optimize their learning behavior through these interactions.

*) In active learning such agents are referred to as the so-called “oracles”

From black-box to glass-box: where is the human-in-the-loop?

The first question we have to answer is: “What is the difference between the iML-approach to the aML-approach, i.e., unsupervised learning, supervised, or semi-supervised learning?”

Scenario D – see slide below – shows the iML-approach, where the human expert is seen as an agent directly involved in the actual learning phase, step-by-step influencing measures such as distance, cost functions, etc.

Obvious concerns may emerge immediately and one can argue: what about the robustness of this approach, the subjectivity, the transfer of the (human) agents; many questions remain open and are subject for future research, particularly in evaluation, replicability, robustness, etc.

Human-in-the-loop - Interactive Machine Learning

The iML-approach

Read full article here:
https://link.springer.com/article/10.1007/s40708-016-0042-6/fulltext.html
https://www.mendeley.com/catalog/interactive-machine-learning-health-informatics-we-need-humanintheloop

Papers due to April, 30, 2016: Privacy Aware Machine Learning (PAML) for Health Data Science

We are organizing a special session on Privacy Aware Machine Learning for Health Data Science at the 11th international Conference on Availability, Reliability and Security (ARES and CD-ARES), Salzburg, Austria, August 29 – September, 2, 2016

supported by the International Federation of Information Processing IFIPTC5 and WG 8.4 and WG 8.9
https://cd-ares-conference.eu
https://www.ares-conference.eu

Keynote Talk by Bernhard SCHÖLKOPF, Max Planck Institute for Intelligent Systems, Empirical Inference Department

Bernhard Schölkopf as Keynote Speaker at the ARES/CD-ARES conference in Salzburg

We are proud to welcome Bernhard Schölkopf as Keynote Speaker to the ARES/CD-ARES conference in Salzburg

Machine learning is the fastest growing field in computer science  [Jordan, M. I. & Mitchell, T. M. 2015. Machine learning: Trends, perspectives, and prospects. Science, 349, (6245), 255-260], and it is well accepted that health informatics is amongst the greatest challenges [LeCun, Y., Bengio, Y. & Hinton, G. 2015. Deep learning. Nature, 521, (7553), 436-444 ], e.g. large-scale aggregate analyses of anonymized data can yield valuable insights addressing public health challenges and provide new avenues for scientific discovery [Horvitz, E. & Mulligan, D. 2015. Data, privacy, and the greater good. Science, 349, (6245), 253-255]. Privacy is becoming a major concern for machine learning tasks, which often operate on personal and sensitive data. Consequently, privacy, data protection, safety, information security and fair use of data is of utmost importance for health data science.

The amount of patient-related data produced in today’s clinical setting poses many challenges with respect to collection, storage and responsible use. For example, in research and public health care analysis, data must be anonymized before transfer, for which the k-anonymity measure was introduced and successively enhanced by further criteria. As k-anonymity is an NP-hard problem, which cannot be solved by automatic machine learning (aML) approaches we must often make use of approximation and heuristics. As data security is not guranteed given a certain k-anonymity degree, additional measures have been introduced in order to refine results (l-diversity, t-closeness, delta-presence). This motivates methods, methodologies and algorithmic machine learning approaches to tackle the problem. As the resulting data set will be a tradeoff between utility, usability and individual privacy and security, we need to optimize those measures to individual (subjective) standards. Moreover, the efficacy of an algorithm strongly depends on the background knowledge of an potential attacker as well as the underlying problem domain. One possible solution is to make use of interactive machine learning (iML) approaches and put a human-in-the-loop where the central question remains open: “could human intelligence lead to general heuristics we can use to improve heuristics?”

Research topics covered by this special session include but are not limited to the following topics:

– Production of Open Data Sets
– Synthetic data sets for learning algorithm testing
– Privacy preserving machine learning, data mining and knowledge discovery
– Data leak detection
– Data citation
– Differential privacy
– Anonymization and pseudonymization
– Securing expert-in-the-loop machine learning systems
– Evaluation and benchmarking

This special session will bring together scientists with diverse background, interested in both the underlying theoretical principles as well as the application of such methods for practical use in the biomedical, life sciences and health care domain. The cross-domain integration and appraisal of different fields will provide an atmosphere to foster different perspectives and opinions; it will offer a platform for novel crazy ideas and a fresh look on the methodologies to put these ideas into business.

Accepted Papers will be published in a Springer Lecture Notes in Computer Science LNCS Volume.

Schedule:

I) Deadline for submissions: April, 30, 2016
Paper submission via:
https://cd-ares-conference.eu/?page_id=43

II) Camera Ready deadline: July, 4, 2016

III) Special Session: August, 30, 2016
> Conference Venue
> Travel Information Salzburg
> Lonely Planet Salzburg

The International Scientific Committee – consisting of experts from the international expert network HCI-KDD dealing with area (7), privacy, data protection, safety and security and additionally invited international experts will ensure the highest possible scientific quality, each paper will be reviewed by at least three reviewers (the paper acceptance rate of the last special session was 35 %).

 

Yahoo Labs released largest-ever annonymized machine learning data set for researchers

In January 2016, Yahoo announce the public release of the largest-ever machine learning data set to the international research community. The data set stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

see: https://yahoolabs.tumblr.com/post/137281912191/yahoo-releases-the-largest-ever-machine-learning

 

January, 27, 2016, Major breakthrough in AI research …

Mastering the game of Go with deep neural networks and tree search – a very recent paper in Nature:

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529, (7587), 484-489.

https://www.nature.com/nature/journal/v529/n7587/full/nature16961.html

Go (in Chinese: 圍棋 , in Japanese 囲碁) is a two-player board strategy game (EXPTIME-complete, resp. PSPACE-complete) for two players aiming to surround more territory than the opponent; the number of he number of possible moves is enormous (10761 with a 19 x 19 board) compared to approximately 10120 in chess with a 8 x 8 board) – despite simple rules. 

According to the new article by Silver et al (2016),  Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. The authors introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play.  The authors introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, the program AlphaGo (see: https://deepmind.com/alpha-go.html)  achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

There is also a news report on BBC:

https://www.bbc.com/news/technology-35420579

Congrats to the Google Deepmind people!

 

Science Magazine Vol.350, Issue 6266

January, 26, 2016, Workshop “Machine Learning for Biomedicine” TU Graz

Date: Tuesday, 26th January 2016, Start: 10:00, End: 17:00; Venue: Graz University of Technology,
Institute of Computer Graphics and Knowledge Visualization CGV, hosted by Prof. Tobias SCHRECK
Address: Inffeldgasse 16c, A-8010 Graz <maps and directions>

Machine learning is the most growing field in computer science  [Jordan, M. I. & Mitchell, T. M. 2015. Machine learning: Trends, perspectives, and prospects. Science, 349, (6245), 255-260], and it is well accepted that health informatics is amongst the greatest challenges [LeCun, Y., Bengio, Y. & Hinton, G. 2015. Deep learning. Nature, 521, (7553), 436-444 ].

Sucessful Machine Learning for Health Informatics requires a comprehensive understanding of the data ecosystem and a multi-disciplinary skill-set, from seven specializations: 1) data science, 2)  algorithms, 3) network science, 4) graphs/topology, 5) time/entropy, 6) data visualization and visual analytics, and 7) privacy, data protection, safety and security – as supported by the international expert network HCI-KDD.

Program see: https://human-centered.ai/machine-learning-for-biomedicine-tugraz/

Happy Scientific 2016

We wish you a prosperous scientific 2016 with a lot of crazy ideas and successful breakthrough discoveries !

Happy New 2016

Happy New Year from the Holzinger Group HCI-KDD

Science Magazine Vol.350, Issue 6266

A proof of the importance of the human-in-the-loop

Again machine learning made it to the title page of Science: A nice further proof for the importance of the human-in-the-loop by a paper of

Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic program induction. Science, 350, (6266), 1332-1338.

Whilst humans can learn new concepts often from a very few examples, automated machine learning (aML) methods ususally need many examples (often called: big data) to perform with similar accuracy (and with the danger of modelling artefacts, e.g. through overfitting).  The authors present a computational model  which captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. Very interesting is the fact that on a challenging one-shot classification task, this model achieves human-level performance and outperforms recent deep learning approaches!

The authors also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior – a must read at: https://www.sciencemag.org/content/350/6266/1332.full