Applied statistics as a pillar of data science

Data science is now growing fast in many places, but scholars at Leiden University have been developing data science techniques for a long time already. Thanks to their broad-based expertise, Leiden statisticians are currently combining the achievements in statistics with the latest methods of statistical and machine learning.

‘It may seem as though data science is something new, but in applied statistics we’ve actually been developing data science techniques for many years,’ says Jacqueline Meulman, Professor of Applied Statistics in the Mathematical Institute. ‘Here in Leiden, we’ve been visualising links and analysing large, complex datasets for at least 35 years.’

SPSS

For example, one of Meulman’s earlier research groups made its first contribution to the well-known statistical data analysis package SPSS, now part of IBM, back in the 1990s. This package is used worldwide by researchers, students and the private sector, and the Leiden statisticians still keep their ‘Categories’ module updated by incorporating the latest technical developments. The royalties paid by IBM are re-invested in research and teaching-related activities.

Complex data analysis

A problem encountered when analysing big datasets is the purity of the data. ‘We’re often trying to detect a signal in the midst of a lot of noise,’ says Meulman. As an example, she mentions a study in the area of metabolomics. This study is looking at identical twins, with the question: is the metabolic system of the twins more similar than can be explained by chance? ‘Analyses of blood and urine produce vast quantities of data, but these data are always complex,’ explains Meulman. ‘For example, one of the twins may have had breakfast that morning, and the other not.’ There are also many variables in such datasets that are completely irrelevant. Meulman and her colleagues use the latest techniques to filter out the noise variables and thus to detect similarities.

Statistical learning

Peter Grünwald, Professor of Statistical Learning, conducts research at the interface between statistics, machine learning and information theory. Briefly put, he develops methods for the statistically sound analysis of data by computers. He gives an example of how important this is. A couple of years ago Google was in the news: the company had predicted a flu epidemic by analysing the geographic locations where people were making a lot of searches for words such as fever, cough and so on. ‘It worked once or twice, but then no more,’ says Grünwald. ‘If a programme detects a pattern, you have to demonstrate that it isn’t chance. For that, you need real statistics.’

Reproducibility crisis

On the basis of statistical learning, the Leiden statisticians are also looking for ways to improve classical statistics with techniques devised in machine learning, an area within computer science. ‘I’m currently working on the reproducibility crisis: the fact that when research is repeated, it often doesn’t produce the same results,’ says Grünwald. ‘This may be because a researcher conducted extra experiments if his first findings weren’t significant enough to permit a firm conclusion. This creates a distorted picture: a purely chance result can suddenly appear significant. There are statistical methods to correct for this, but they’re very complicated. I’m now trying to improve those methods by applying ideas from machine learning and information theory.’

Mathematical Institute (Statistical Science group)