Universiteit Leiden

nl en

Research programme

Data Science

The Leiden Centre of Data Science Research Programme brings together data science with all other academic domains. It makes the unique university data collections available.

Wessel Kraaij

Today, more data than ever before are gathered in all scientific fields. Data that, if opened up, could provide researchers with tremendous amounts of useful information.

Joint initiative of seven faculties

The Data Science Research Programme is a joint effort of all seven faculties of the University of Leiden. PhD students are playing key roles within the programme. In total, at least 21 PhD candidates will be appointed. They will work on the development and application of data science in various fields of research at Leiden University. 

Combine expertise

The PhD candidates will each have two supervisors or co-supervisors, one a specialist in data science, and one from the faculty's research field. The aim is for the programme to have an open structure: besides the PhD candidates, other researchers are very welcome to join in the research and students can work on projects.

DSRP PhD students

At the moment, the Data Science Research Programme  consists of the followings projects.


Automating archaeological object detection in remotely sensed data

Environmental data feature a variety of spatial, temporal, and spectral dimensions that potentially carry relevant information. To analyse such complex data, new tools that are tailored to archaeological requirements are required, e.g., for site detection. To this end, this project aims at a generic approach to (semi-)automated archaeological prospection that allows a wide variety of archaeological traces to be detected across different data sources.

Big data in archaeology: harnessing the hidden knowledge in the “graveyard” of Malta reports

This project will investigate the analysis and indexing of the full corpus of archaeological reports produced over the last 20 years of Malta research, which is more than 50,000 in number and quickly growing. The goal is to establish a visual search and querying service that allows researchers to quickly retrieve the most valuable digital resources, in order to allow them to integrate and synthesise the results into a coherent narrative of the past.

Governance and Global Affairs

Automated text analysis of policy-related documentation

In this project we aim at further developing the coding of text in policy-related documentation, linking to existing efforts to include text analysis in the area of governance and global affairs. Since policy-related text is often well formulated we will follow a syntactic approach. By developing new and applying existing machine learning algorithms, this project aims to shed more light on ideas, positions and cleavages in current political and policy debates.

Improving citizens’ participation in public service delivery: the possibilities of data dashboards

In order to use big data in research on policy(-making), an automated ‘coupling’ assistant is needed that recognises and connects different data streams so that they can be analysed without time-consuming manual procedures. It would be of immense value to identify the limits and opportunities for social science research to make use of data dashboards. This project aims at exploring these questions and dilemmas, requiring knowledge of both data science and the social sciences.


African sign languages

The goal of this project is to innovate some of the worldwide most widely used tools in the analysis of signed languages. This will include expanding the functionalities of SignBank, a lexical database for sign language corpora with the purpose of enabling cross-corpus compatibility. Further, the project will also explore ways in which automated image analysis (including 2D and 3D images) can be used for semi-automated lemma generation and the encoding of basic phonological features. Read more

Detecting cross-linguistic syntactic differences automatically

The main goal of comparative syntactic research is to discover the syntactic principles that all natural languages have in common, but so far it has been impossible to compare large sets of syntactic constructions in large sets of languages systematically and automatically. The online availability of parallel text corpora and software tools to align, enrich, search and analyse them has the potential to make automatic massive systematic cross-linguistic syntactic comparison possible for the first time. Read more


Measuring relevance and relations of Dutch legal publications

Legal scholars and professionals are confronted with a rapidly increasing volume of legal publications. Only part of these publications are relevant enough to be cited. This project aims to determine which documents that are, and whether alternative metrics are a reliable way to predict whether documents will be cited, in order to be able to present the user the most relevant publications first. 

The international tax system as a complex system

In this project we aim to apply the research perspective of complexity science on the international tax system in order to investigate whether and how we can provide a (mathematical) understanding of its behavior. For example, network patterns and/or gaming behavior might provide new and fresh insights in, e.g., avoidance patterns, the rise and fall of tax havens, and business investment strategies over time.

Medicine / Leiden University Medical Center

Obesity-related diseases and mortality

The primary aim of the project will be to analyse the vast Netherlands Epidemiology of Obesity (NEO) database, a unique, large, and valuable data source to study the many pathways that may lead to obesity-related diseases. It includes data from many participants, many different sources of clinical information, and a vast amount of clinical endpoints. Further, the secondary goal is to link the data to external databases to unravel the pathophysiology of obesity related diseases.

HyperImage: Visual analytics techniques for biomarker discovery in massive 3D-omics datasets

 “omics imaging” techniques enable detailed profiling of entire tissue sections, producing images where each pixel contains, e.g., a mass spectrum with tens to thousands of values. The high-dimensionality, massive amount and non-linear structure of such high-dimensional image data pose considerable challenges for analysis and interpretation. The visual analytics technologies developed in this project will open up the full biomarker discovery potential of 3D “omics imaging” techniques.


Deciphering the pharmacometabolome of statin therapy to enable precision medicine

The aim of this project is to discover novel biomarkers that can predict statin treatment response variability, and which can shed light on the different contributors to variation in statin treatment response. To this aim, we will perform an integrated analysis of the *omics datasets in the Rotterdam study. We will use statistical learning algorithms and increase the statistical power and biological relevance of the analyses by constraining the molecular profiling datasets to biochemical pathways.

A new era for nature conservation using hyperspectral and lidar data; Oostvaardersplassen as a case study

This project aims to develop advanced data analysis methods for monitoring and increasing our understanding on biodiversity dynamics in nature reserves such as the Oostvaardersplassen. Earth observation methodologies have incredibly improved over the past decade. As a result, applications to nature management come in range, but these demand new ecoinformatics tools for nature conservation, e.g., for tracking animals based on hyperspectral data, and for linking spatial and temporal patterns of animal movement to vegetation characteristics.

Social and Behavioural Sciences

Understanding scientific progress by analysing the context of scholarly citations

The objective of this project is to fundamentally improve our understanding of the ways in which science progresses. Empirical studies have used bibliographic metadata to provide relevant insights, but these studies have failed to tell us how science progresses. Supported by computational advances and improved data access, we propose a large-scale data-driven approach in which scientific progress is studied based on the full text of scientific documents.

Stacked Domain Learning for multi-domain data: a new ensemble method

This project aims to develop statistical methods for the analysis of multi-domain data that can deal with differences in data quality. For the early diagnosis of Alzheimer disease, for example, questionnaire data, structural and functional MRI data, EEG data, and genetic data can be collected. These types of data differ not only in size, but also in quality. To obtain an accurate early diagnosis it is important to identify the relevant features but also to look for cross-domain interactions.

The organisation of the Data Science Research Programme is with Roos van de Voordt of the Faculty of Science.

Visiting address

Sylvius Building, room 1.5.14
Sylviusweg 72
2333 BE Leiden

Tel. +31 71 527 4806

This website uses cookies. Read more