Universiteit Leiden Universiteit Leiden

Nederlands English


Causal Discovery from High-Dimensional Data in the Large-Sample Limit

Developing robust algorithms and theory for establishing cause-effect relationships from observational data that scale up to large data sets

Contact Aad van der Vaart
Financiering NWO TOP grantNWO TOP grant

A joint project of Prof. Tom Heskes at Radboud University Nijmegen and Prof. Aad van der Vaart at Leiden University.


Discovering causal relations from data lies at the heart of most scientific research today. Controlled experimentation, the standard and most popular method for causal discovery, is in many cases practically impossible, ethically undesirable, or too costly. About twenty years ago, scientists realized that there is an alternative: under appropriate assumptions, causal knowledge can also be derived from purely observational data. In the 'big data' era, such observational data is abundant and being able to actually derive causal relationships from very large data sets would open up a wealth of opportunities for improving business, science, government, and healthcare.
Sadly, existing algorithms for causal discovery from observational data are not very well suited to big data: small changes in the data or in the algorithmic details can lead to significantly different causal conclusions, in particular for data sets containing many different variables and even in the limit of a large number of samples. In this project we aim to tackle these issues through a much better mathematical understanding of the appropriate asymptotic statistics. Effective causal discovery, in one way or another, hinges upon the ability to accurately and efficiently infer sparse models. We will therefore translate and extend existing work in mathematical statistics on sparse model estimation to the domain of causal inference. Improved mathematical understanding guides the development of novel, more reliable algorithms for causal discovery from big data that can control the false causal discovery rate. We demonstrate the usefulness of these algorithms on real-world problems in genomics and ecology."