There are many different machine learning algorithms for a certain problem, but which one to chose for solving a practical problem? The comparison of learning algorithms is very difficult and is highly dependent of the quality of the data (which is most often ignored!). Sometimes it is much better to have higher quality data, instead of having more sophisticated algorithms. In a recent article in CACM [1], Tim Roughgarden [2] brings it to the point: “For almost any pair of algorithms and measure of algorithm performance such as running time or solution quality, each algorithm will perform better than the other – depending on the input data on some inputs. Note: In some cases a problem admits an instance-optimal algorithm, which is as good as every other algorithm on every input, up to a constant factor However for most problems, there is no instance-optimal algorithm, and there is no escape of the incomparability of different algorithms [2].

This brings us back to fundamental questions of machine learning and explainability (in German Erklärbarkeit als Eigenschaft eines Algorithmus) and causability (in German: Kausalisierbarkeit als Eigenschaft eines Menschen in Abgrenzung zur Kausalität) [3]: According to Roughgarden, ” …. the unreasonable effectiveness of statistical machine learning algorithms has thrown down the gauntlet to algorithms researchers, and there is perhaps no other problem domain with a more urgent need for the beyond worst-case approach.” [1], p.92

As it is well known to everybody in a typical supervised learning problem we present the learning algorithm a dataset of object-label pairs and the goal is to produce a classifier that predicts the label of as-yet-unseen objects (e.g. whether or not an image contains a cat). Here the family of deep learning algorithms achieved impressive results due to two important conditions: 1) the algorithm is trained with large amounts of data sets ad 2) we have sufficient computational power available. However, the empirical success are questionable for two important issues: 1) most neural network training algorithms use first-order methods, i.e. variants of gradient descent (the look simply for the local minima!) in order to solve nonconvex (highly non-linear and non-stationary) optimization problems that had been written off as computationally intractable. The main question is: Why do these algorithms so often converge so quickly to a local optimum, or even to a global optimum? 2) most neural network algorithms are over-parameterized, meaning that the number of free parameters (weights and biases) is considerably larger than the size of the training data sets. However, over-parameterized models are extremely vulnerable to large generalization errors (I emphasize this always in all my lectures!) but despite this facts these algorithms generalize quite well on many different problem tasks. How can we explain this? The answer likely hinges on special properties of both real-world datasets and the optimization algorithms used for neural network training (principally stochastic gradient descent).

These effects call for urgent fundamental research. For setting up such experiments artifical data generators are necessary, which can simulate worst-case behaviour!

[1] Tim Roughgarden 2019. Beyond Worst-Case Analysis. Communications of the ACM (CACM), 62, (3), 88-96, doi:10.1145/3232535.

[2] Tim Roughgarden is a Professor of Computer Science and member of the Data Science Institute at Columbia University

[3] The notion of Causability is differentiated from explainability in that Causability is a property of a human (natural intelligence), while explainability is a property of a technologial system (artificial intelligence). Das ist wichtig für die Studierenden der Holzinger-Group: Causability ist eine Eigenschaft einer Person (natürliche Intelligenz), während die Erklärbarkeit eine Eigenschaft eines technischen Systems darstellt (künstliche Intelligenz), siehe: Andreas Holzinger et al. 2019. Causability and Explainability of AI in Medicine. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, doi:10.1002/widm.1312.

see our recent post: