Universiteit Leiden

nl en

Leiden University Centre for Digital Humanities

Nyst Project

Exploring new methods in comparing sign language corpora: analysing cross-linguistic variation in the lexicon

Victoria Nyst, LUCL

Aims of the project:

1 Develop new tools for facilitating the compilation and harmonization of a cross-linguistic ID gloss database for sign languages of deaf communities (henceforth: SLs).

2 Explore the application of methods designed for measuring and modelling variation in spoken languages onto SL corpus data.

Leiden Sign Language Corpora

Between 2007 and 2014, four large video corpora of West African SLs have been compiled at Leiden University, under guidance of the applicant. From 2007 to 2012, two large projects took place to document local SL use at various places in Mali and Ghana leading to three digital video corpora. The first Malian Sign Language corpus contains recordings of SL use by deaf signers in Bamako and Mopti, consisting of over 27 hours of recorded discourse, featuring 65 signers (Nyst 2008; 2010). The second Malian Sign Language corpus contains the results of a SL survey in the Dogon area of Mali, notably in Bandiagara, Douentza and surrounding villages and Berbey, close to Hombori. This corpus contains 32 hours of signing of 68 signers and includes signed conversations, interviews and lexical items in at least three independent SL varieties (Nyst, Sylla, Magassouba, forthcoming). The third SL corpus contains a set of discourse data of around 30 hours and 15 signers of Adamorobe Sign Language, which emerged spontaneously in response to the high incidence of hereditary deafness in the village of Adamorobe, Ghana. In 2014, the fourth extensively annotated corpus was archived, containing a representative sample (around 30 hours of 9 signers) of the emerging SL of the village of Bouakako in Côte d’Ivoire. All corpora are glossed in ELAN (Crasborn & Sloetjes 2008). All corpora are glossed in French, except the Adamorobe Sign Language corpus, which is glossed in English and Akan. Lexical databases with phonological coding are available for all corpora, except the second Malian SL corpus. All corpora are stored either at the Endangered Language Archive in London or the DoBeS archive in Nijmegen, or both.


Comparison of these corpora would open a unique window on the type, distribution and degree of lexical variation found within and across SLs. Analysis of cross-linguistic variation of SL lexica as documented in corpora is crucial for historical-comparative studies of SLs as well as for understanding contemporary patterns of variation. Indeed, our understanding of how SLs cluster in language families is still mainly based on historical information and urgently needs to be informed by language-internal evidence. The West African SLs in the corpora provide a rather unique opportunity for studying how related and unrelated SLs compare lexically. Whereas most SLs studied so far have been used by communities that have been in contact with each other and are either related or have a shared history of contact. The set of corpora include various village SLs that have evolved in communities with a high incidence of hereditary deafness. The absence of a shared history between these village SLs enables us to contrast variation between related and unrelated SLs, not influenced by contact. Over the past decade, corpus projects documenting SLs have started in various countries. The cross-linguistic comparison of these corpora is complicated for various reasons, one of them being the lack of a shared orthography for SLs. Instead of using an annotation system with components representing the main formal components of a sign, ID glosses are typically used. These consist of a uniquely identifying spoken language word (written in capitals) that by definition refers to a particular sign form, e.g. the ID gloss ENTER for the sign in figure 1. ENTER in fact has multiple meanings, including ‘go in’ or ‘put in’.

Figure 1 ENTER (Adamorobe Sign Language, Ghana [Nyst, 2007])

Currently, measuring variation across corpora is complicated by the non-compatibility of the glosses used. ID gloss databases are set up separately for each SL corpus. The glosses draw from different spoken languages and that there is no one-to-one relation between glosses and the meaning(s) of the signs. New methods for measuring variation in and across SL corpora need to be explored.


To enable cross-linguistic comparison of SL corpus data, new methods and tools need to be developed. A project in which a PhD student in Data Science collaborates with the PI and a deaf research assistant would be ideal for this purpose. The set of Leiden SL corpora can be used for the purpose of developing and testing the tools and methods explored in this project.

PhD project:

1 Set up a database with shared ID glosses. The project will probably use SignBank for the purpose of storing and managing a unified set of ID glosses for the four corpora (Cormier et al. 2012). This format that is gaining in popularity and is currently used for Australian SL, British SL, Dutch SL.

2 Automated entry generation & phonological coding. Ensuring a consistent use of ID glosses is extremely labour intensive, but a condition for reliable quantitative analyses of SL corpora. To achieve this consistency, a lexical database is needed to keep track of signs, their variants, and their ID glosses. Each entry consists of a video clip, a gloss, and phonological coding of the form of the sign. Partial automatization of this process will reduce the exorbitant time investment required for the creation of ID gloss repertoires. Various components of the process may be automatized, including automated entry generation and automated generation of suggestions of matching candidate glosses. A promising improvement that will be explored is the automated phonological coding on the basis of sign input captured with a 3D camera (such as Microsoft ® Kinect). Kinect can automatically detect the position and orientation of 25 joints (including the thumb) as well as facial expressions and a small number of studies have experimented the use of this device for automated SL recognition (Halim & Abbas 2015).

3 Automated analysis & visualization of variation Once a (pilot) set of glosses have been harmonized across the corpora, various tools for measuring and modelling variation will be explored. This will include testing the usefulness and applicability of GabMap, an online tool for measuring and mapping dialectal and other variation in spoken languages (Nerbonne et al 2011).


The area of SL corpus linguistics is rapidly growing. The outcomes of this project will open up new possibilities for analysing variation in and across these corpora. The tools and methods explored and developed in the proposed project will provide a major step forward in the historical-comparative analysis of SLs – an area where the field of SL linguistics is strikingly lagging behind as compared to the field of spoken language linguistics.

Ideally, the results of this research project will be made accessible to the research community by creating userfriendly tools that make assessing variation and/or historical-comparative analyses of SLs accessible to the SL linguistics community. An important output consists of a tool that will facilitate the compilation of a database of ID glosses for SLs for which no collection of ID glosses exists prior to the compilation of a discourse corpus. Another tool will consist of an add-on that allows SL researchers to apply user-friendly tools developed for measuring spoken language variation (like GabMap) on SL data.

Workplan & Supervision

Year 1

  • acquire a basic knowledge of SL structure (phonology, morphology, and lexical semantics).
  • familiarize with ELAN (software used for annotation of the Leiden SL corpora).
  • inventory the possibilities and limitations of lexical databases, including SignBank, used with SL corpora.
  • Write paper I: identifying challenges and opportunities for applying new methods and tools to lexical databases of SLs.

Year 2

  • Design software (lexical ontology/relational database) for the storage and management of standardized glosses across the corpora (Standardized Glosses Tool).
  • Test the tool with a representative group of users.
  • Improve the Standardized Glosses Tool on the basis of feedback of the test group.
  • Write paper II on the development of the Standardized Glosses Tool and the feedback study.

Year 3

  • Explore the possibilities for automated entry generation plus automated phonological encoding in the Standardized Glosses Tool with a Kinect sensor.
  • If time allows, explore the possibilities for automated entry generation on the basis of still images and video clips
  • Design and test tools for automated entry generation
  • Write paper III on automated entry generation for lexical databases.

Year 4

  • Write paper IV on automated phonological encoding.
  • Explore the possibilities of using an existing, user-friendly tool for the automated analysis & visualization of variation (e.g. GabMap [Nerbonne et al. 2011]) on the data in the Standardized Glosses Tool.
  • Write paper IV on analysing and visualizing geographic and other variation in lexical signs in and across SLs.
  • Merge papers into a PhD thesis. 

The student will meet his/her supervisor biweekly to discuss the progress. The student will be encouraged to spend a semester abroad to learn new skills and enlarge his/her network. The results of this project will be regularly reported in during scientific meetings, including the SL workshops of the LREC conferences, and the largest SL conference TISLR.


Cormier, K., Fenlon, J., Johnston, T., Rentelis, R., Schembri, A., Rowley, K., ... & Woll, B. (2012, May). From corpus to lexical database to online dictionary: Issues in annotation of the BSL Corpus and the development of BSL SignBank. In 5th Workshop on the Representation of Sign Languages: Interactions between Corpus and Lexicon [workshop part of 8th International Conference on Language Resources and Evaluation, Turkey, Istanbul LREC 2012. Paris: ELRA. pp. 7–12.

Crasborn, O., Sloetjes, H. (2008). Enhanced ELAN functionality for sign language corpora. In: Proceedings of LREC 2008, Sixth International Conference on Language Resources and Evaluation.

Halim, Z., & Abbas, G. (2015). A Kinect-based sign language hand gesture recognition system for hearing-and speech-impaired: a pilot study of Pakistani sign language. Assistive Technology, 27(1), 34-43.

Nerbonne, J., Colen, R., Gooskens, C., Kleiweg, P., & Leinonen, T. (2011). Gabmap-a web application for dialectology. Dialectologia: revista electrònica, 65-89.

Nyst V., Magassouba M.M. & Sylla K. (2011) Un Corpus de Reference de la Langues des Signes Malienne I. A digital, annotated video corpus of the local sign language used in Bamako and Mopti, Mali. Leiden University Centre for Linguistics, Universiteit Leiden.

Nyst V., Magassouba M.M. & Sylla K. (2012) Un Corpus de reference de la Langue des Signes Malienne II. A digital, annotated video corpus of local sign language use in the Dogon area of Mali. Leiden University Centre for Linguistics, Universiteit Leiden.

Nyst V. (2012) A Reference Corpus of Adamorobe Sign Language. A digital, annotated video corpus of the sign language used in the village of Adamorobe, Ghana. Leiden University Centre for Linguistics, Universiteit Leiden.

Tano, A. (2014) Un corpus de référence de la Langue des Signes de Bouakako (LaSiBo). Leiden University Centre for Linguistics, Universiteit Leiden.