SAILS Lunch Time Seminar
- Monday 4 April 2022
- Online only
Can BERT Dig It? - Named Entity Recognition for Information Retrieval in the Archaeology Domain
The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (~658 Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities.
In this talk, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF).
We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735.
Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model’s quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary. At the end of the presentation, a short demonstration of the entity search system is given.
The SAILS Lunch Time Seminar is an online event, but it is not publicly accessible in real-time. If you would like to join this seminar, please send an email to firstname.lastname@example.org to receive a link.