Lecture | Seminar
Data Science meets Humanities
- Date
- Friday 12 April 2019
- Time
- Explanation
- Followed by drinks and snacks at the Faculty Club
- Location
-
P.J. Veth
Nonnensteeg 1-3
2311 VJ Leiden - Room
- 1.01
12.00-14.00 | Additional program upon registration | Guided tour (1 hour) through the famous 17th-century library Thysiana. Time slot will be communicated with the confirmation. |
14.30 |
Doors open |
Coffee, tea |
15.00 |
Wessel Kraaij |
Opening |
15.10 |
Martin Kroon |
Towards automatic detection of syntactic differences |
15.30 |
Jan Odijk |
Data Science for generating linguistic knowledge |
16.00 |
Break |
Drinks and snacks |
16.15 |
Manolis Fragkiadakis & Victoria Nyst |
Creating a tool supporting annotation and analysis of sign language corpora |
16.45 |
Paul Vierthaler |
Extracting and sourcing quotes in a large Chinese corpus |
17.15 | Sjef Barbiers | Wrap up |
17.20 |
Faculty Club: drinks & snacks |
Register for free via this link.

Programme
The Data Science Research Programme's traveling seminar series along all faculties is visiting the Faculty of Humanities! In this era of globalisation, humanities are more important than ever. Migration, integration, trade and technology are blurring the borders between countries and cultures. To be able to cooperate and live together, it is crucial that we understand each other. Research and education range from languages, cultures, area studies to history, philosophy, arts and religous studies. We are driven by passion and curiosity about the world around us.
Centre for Digital Humanities
The Centre for Digital Humanities focuses on the role of humanities in a digital age, by using computational research approaches. It promotes the informed and critical uses of digital technology and computational approaches in art, literature, history, area studies, linguistics, philosophy, religion, and other disciplines of the humanities. Today, cultural artifacts are increasingly available in digitised or born-digital form and computing power, which has grown exponentially over the last several decades, is more accessible than ever.
The centre aims to bring together students and faculty from across the Leiden community to explore the crossroads of the humanities and computing. Its inclusive approach envisions the digital humanities as an umbrella under which researchers and students adapt computational and computer-aided methods to access, analyse, sequence, and present cultural artifacts in new ways.
Data Science within Humanities
At this seminar the various appliances of Data Science within Humanities will be demonstrated in a series of short presentations by PhD candidates, also part of the Data Science Research Programme, and their supervisors from the Faculty of Humanities. The topics range from new methods for comparing sign language corpora to the possibilities for automatic detection of cross-linguistic syntactic differences.
The afternoon will be closed with drinks and snacks at the Faculty Club in the Academy Building.
Martin Kroon: Towards automatic detection of syntactic differences
“The field of comparative syntax aims at developing a theoretical model of the syntactic properties all languages have in common and of the range and limits of syntactic variation. Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance the development of such a model. Eventually we aim for an algorithm that automatically extracts syntactic differences form parallel corpora. Currently we investigate the application of pattern mining algorithms that are based on the minimal description length principle in this task. Can they be used to automatically extract syntactic differences? In this talk I will argue that they can, showing some promising results.”
Jan Odijk: Data Science for Generating Linguistic Knowledge
“As a linguist, I am interested in obtaining linguistic knowledge, i.e. knowledge about the nature of language in general and about specific languages in particular. I investigate whether Data Science can contribute to the generation of linguistic knowledge. I will first sketch the system that we are currently working on, which will generate all kinds of statistics on linguistic properties of any construction of the Dutch language. This system is an extension of the GrETEL 4 treebank application. I will illustrate for what kind of properties this system will be able to generate statistics on the basis of a concrete example.
Then I will sketch three domains of research for which I hope to be able to use this system: (1) for research into properties of the Dutch language, not only for testing hypotheses and theories, but also for suggesting undiscovered relations between properties of a construction; (2) for research into the study of (first) language acquisition by young children, possibly even for setting up a simulation of language acquisition; and (3) for languages for which no parsing technology exists but just a parallel corpus with a language for which parsing technology does exist.”
Manolis Fragkiadakis + Victoria Nyst: Creating a tool supporting annotation and analysis of sign language corpora
“Over the last years various corpus projects documenting sign languages have started all over the world. Between 2007 and 2014, four large video corpora of West African sign languages have been compiled at Leiden University. Due to the lack of an efficient and user-friendly orthography, sign language corpora typically consist hundreds of hours of time-aligned videos annotated, using glosses from a spoken language, in ELAN. The annotation process for creating large, searchable corpora is still extremely labor intensive. Machine learning offers exciting opportunities for facilitating the annotation and analysis, both quantitative and qualitative, of sign language corpora."
"The aim of our current project is to develop a tool that supports the annotation and analysis of sign language corpora, including the automatic identification of the presence of a sign and of its fundamental formal properties of signs (i.e. handshape, location/movement, and number of hands). A challenge for this type of tools is the diverse nature of sign language videos, in terms of background, number of signers, skin color of signers, quality of the visual image, among others. The tool currently being developed should be able to cope with as many of these diversities as possible, using a relatively limited set of training material. So far, the following functionalities have been developed: 1) recognition of the exact time-frame a sign occurs, 2) removal of redundant information from the raw video by using a pose estimation framework (i.e. OpenPose), 3) extracted hand locations have been used to train and test four different classifiers. The result of this process so far is a tool that uses XGBoost to accurately predict the span of a sign and automatically create a time-slot for annotation.”
Paul Vierthaler: Extracting and Sourcing Quotes in a Large Chinese Corpus
“In this talk I will discuss my recent research into developing algorithms that facilitate the study of large collections of Chinese documents. I will introduce my approach for identifying intertextuality (that is, when one text copies from another) and discuss some of my recent experiments in developing machine learning algorithms to predict the origin of the detected quotes. This process of automatically detecting source materials is aimed at tracing lineages of information in late imperial Chinese literature, a central concern of my current monograph project.”

Additional program
As an optional part of the program a guided tour through the famous 17th-century library Thysiana. The tour is scheduled prior to the program, 1 hour between 12.00-14.00, and only possible upon registration.
The Data Science Research Programme is associated to LIACS and MI.