Lecture | Sociolinguistics series
MacBERTh & GysBERT meet socio-linguistics: using machine learning to automate annotation and analysis in historical corpora
- Friday 14 October 2022
- LUCL Sociolinguistics Series 2022/2023
2311 BD Leiden
In this talk, I will demonstrate how contextualized embeddings – which are a type of compressed token-based semantic vectors – can be used as annotation and research tools. More specifically, I will focus on the use of the Bidirectional Encoder Representations from Transformers model, also known as ‘BERT’ (Devlin et al. 2019).
Originally, BERT was set up for Present-day English, having been pre-trained on 3.2 billion words of Present-day English Wikipedia and Google books data. Yet, researchers who interpret and analyse historical textual material are well-aware that the interpretation of textual/linguistic material from the past should not be approached from a present-day point-of-view. Hence, NLP models pre-trained on present-day language data are less than ideal candidates for the job. For the case study presented in this paper, we use two variants of BERT called MacBERTh (Manjavacas and Fonteyn, 2021, 2022), which has been pre-trained on approximately 3.9B (tokenized) words of historical English (time span: 1450-1950), and GysBERT, which has been pre-trained on 7.1B (tokenized) words of historical Dutch (time span: 1500-1950).
These models will be put into action in two different but thematically related case studies in historical socio-linguistics on individual-level language variation. The first case study, which focusses on variation and change in the use of English ing-forms by Early Modern English individuals, demonstrates how the models can be used to automate grammatical annotation. The second case study demonstrates how contextualized embeddings can be integrated into lexical diversity measures to allow us to not only consider the ‘vocabulary richness’ but also the ‘semantic richness’ of texts produced in different genres and by different authors.