Universiteit Leiden

nl en

Lecture | LUCL Colloquium

How extensive is a grammar? Explorations in measuring grammatical descriptions

Date
Tuesday 14 October 2025
Time
Series
LUCL Colloquium - Series '25 - '26
Location
Lipsius
Cleveringaplaats 1
2311 BD Leiden
Room
2.27

Abstract

Of the 7 000 languages on our planet, some are described in great detail in extensive grammatical descriptions (such as Finnish or English), some in a smaller number of shorter publications (e.g., Betoi --- an extinct language of Venezuela), and yet others hardly at all (e.g., Mor --- a minority language of Papua, Indonesia).  At the same time, the languages of the world are endangered to various degrees (see, e.g., Moseley 2010). In order to best prioritize language documentation, knowledge of the extant documentation for every single language is obviously crucial (Hauk & Heaton 2018).

Perhaps surprisingly, until now there has never been a full survey, but thanks to Glottolog (glottolog.org) there is now a sufficiently comprehensive bibliography for lesser-known languages. Nevertheless, a limitation is the reliance on conventional classes of description (grammar, dictionary, etc.), which only weakly reflects the full continuum of language description. 


A naive method of improvement, which is still "better than nothing'',is to simply count the number of pages of the corresponding publication.  One of several important drawbacks is that page numbers are not additive, i.e., the sum descriptive content of two different but similar books is not the sum number of pages of the two, since they overlap to some extent. In a large project on digital language descriptions we have access to a full-text database of some 30 000 publications spanning over 6 0000 languages written in over 50 different (meta-)language. Thanks to the full-text access we can improve on the page-number estimate by counting the terms that relate to language description (e.g., suffix, imperative, plural, etc.) in a way that implements additivity. Furthermore, cited references can be semi-automatically extracted as used are a measure of how-well integrated the analysis is with relevant literature available at the time. In this presentation we provide empirical results on which automated measurements most resemble human judgements of the "extent'' of language description of the same documents.

References

  • Hauk, Bryn and Raina Heaton. 2018. Triage: Setting Priorities for Endangered Language Research. In Lyle Campbell and Anna Belew(eds.), Cataloguing the World's Endangered Languages, 259-304. London: Routledge.
  • Moseley, Christopher. (2010) Atlas of the world's languages in danger. 3rd edn. Paris: UNESCO Publishing.
This website uses cookies.  More information.