Big data cannot do without statistics
Data science and statistics are closely linked, says Spinoza laureate and Stochastics professor Aad van der Vaart. We talk with him about the big data hype, genome research and collaboration with other disciplines. ‘Statistics helps wherever data is not perfect.’
Some people say: data science is not really new, it's mainly just statistics. Do you agree with that?
‘Yes, as a statistician for sure. There has always been data, and to analyze it statistics was needed. There’s ever more data available, but that is not something of the past few years. Fifteen years ago, we were already speaking of a data explosion. And ten years ago, there were people who said: if we perform a data analysis on all the data collected by insurance companies, we will no longer need doctors.’
But it's not that simple?
‘No, and it amazes me how easily people speak about data analysis, in the big data hype - mainly outside science. Some seem to think that if we have a lot of data, it's just a matter of finding patterns. But it is more complicated than that, of course. We must not only find correlations, but also causality. In order to find new medical treatments, for example, it's not enough to see a pattern in the data. You should also ensure that the people you are comparing do not accidentally differ in many ways.’
And for that, statistics is needed?
‘Exactly. Satistics helps wherever data is not perfect, and that's almost always the case. For example in medical research, if you want to compare the brains of people using a PET scanner. There is a lot of noise in such scans. First of all, in the physical process itself: radioactivity, scattering of particles. But you also have to deal with variation between people, in their genes, but also in the things that they’ve done the night before. Or in the small movements they make while in the scanner. You want to filter out all that noise, in order to find the signal that really makes the difference. That is what statistics is used for. ’
Does big data call for new methods in statistics?
‘Yes. Not only is there more and more data available, there are also more variables, which leads to lots of new correlations. So it only gets more complicated; everything is connected with everything. In statistics, we are trying to develop new methods to still find patterns in all this complex data.’
You do a lot of research on Bayesian statistics. What is that exactly?
‘It is one of two paradigms of statistics. Bayesian statistics is a very nice way to draw conclusions in terms of uncertainties. This is done by formulating the uncertainties that you have before collecting the data in a so-called a-priori probability distribution, and then updating them with data. With this form of statistics we describe the world in probabilities, even before the data has been collected. And when there is new data, we update the probabilities.’
Could you give an example?
‘Imagine that you are tested for a particular disease and that the result indicates that you have the disease. How big is the chance that you are actually ill? In order to know this, we consider the likelihood of a false positive (the probability that the test indicates that you are ill, when this is not actually the case). But what we also take into consideration in Bayesian statistics, is the a-priori information: how many people in the total population suffer from the disease. If a disease is very rare, the chance that someone has it is very small, even if the test result suggests something else. Doctors also work according to this principle: they do not immediately sound the alarm if they see a symptom that could point to a rare disease, since it is unlikely that the patient really has the disease. Bayesians weigh probabilities in this way in every analysis, even when they are dealing with large amounts of complex data.’
In what kind of research is this applied?
‘In genome research, for example. In the past fifteen years, researchers have been able to measure an entire genome, so all of someone's genes. At the moment, a lot of studies focus on how all the genes cooperate in networks and which genes are active in a given situation, for example, in a given disease. Researchers are looking for links between genes, but the fact that there are so many genes makes this very complex. By making use of Bayesian statistics, we can, for instance, add as a-priori information that of all the genes in a certain situation only a small number are important. And, more importantly, it makes it easier to include information from previous research, which is often available in databases.'
Do you also collaborate with researchers from a completely different area sometimes?
‘Yes. For example, I did a research project with historical demographers. We investigated the life expectancy of people in the 17th century. From church records, there was a lot of data available on births, marriages and deaths, and based on that the demographers wanted to estimate how old people got at that time. But the data was messy: not everyone's death date could be found. Therefore, in order to estimate people’s life expectancy as accurately as possible, statistical correction was necessary.’
Do you believe there is enough collaboration between statisticians and researchers from other disciplines?
‘There’s already a lot of collaboration. It is something many statisticians do; statistics can be used in so many areas of science. But I think it's not enough yet. From my perspective, I’d say there should be more statisticians, who use part of their time for fundamental research and the other part to work with researchers from other scientific disciplines. I also want to use part of the funds from my Spinoza Prize for that: appointing more statisticians and forging links with other disciplines.’
How do you expect your field to develop in the future?
‘It seems that we’ve arrived at a point where computers can really add something important. There is so much computing power now, that new things can really happen. Statistics will still be needed, perhaps even more than ever. Think of the self-driving car: such a car must have all kinds of sensors and has to process lots of data. In all this data, of course, there will be a lot of variation and noise. Statistics will be necessary in order to keep control of that.’
Aad van der Vaart obtained his PhD at Leiden University in 1987. He then worked in College Station (Texas), Paris and Berkeley. He was also a visiting professor at Harvard and Seattle. For many years, Van der Vaart was professor at VU Amsterdam. Since 2012 he is professor of Stochastics at Leiden University, where he became Scientific Director of the Mathematical Institute in 2015. Van der Vaart won the C. J. Kok award in 1988 and the Van Dantzig Prize in 2000. In 2015, he received the Spinoza Prize for his pioneering research in statistics.
This article is part of a series of interviews with researchers from the Leiden Centre of Data Science (LCDS). LCDS is a network of researchers from different scientific disciplines, who use innovative methods to deal with large amounts of data. Collaboration between these researchers leads to new solutions to problems in science and society.