Converting cultural heritage into usable data
How can we make the information in handwritten historical research reports accessible and searchable? Data scientists at Leiden University are collaborating with other universities on a method for improving access to cultural heritage.
Between 1820 and 1850, eighteen explorers from the Natural Sciences Commission for the Dutch East Indies travelled through the Indonesian archipelago. During their expeditions, they studied the exotic flora and fauna. Their reports, totalling around 17,000 richly illustrated pages, are held by the Naturalis Biodiversity Centre. The collection gives a magnificent picture of the biodiversity in that region at the beginning of the 19th century.
The pages of the reports have now been scanned and are digitally accessible, but simply googling them by place name or animal species is not yet an option. The research project ‘Making Sense of Illustrated Handwritten Archives’ aims to change this. By converting the archives into searchable and analysable data, other researchers will soon be able to cast new light on a wide range of historical and biological questions. In addition to Leiden University, the other participants in this project are Naturalis Biodiversity Centre, the University of Twente, the University of Groningen and the publisher Brill.
Data patterns in a jumble of images
The main task of the researchers in Leiden, Twente and Groningen is to train the computer to distinguish between the different kinds of information in the historical documents. Human beings can see at a glance the difference between an illustration and a handwritten sentence. For an untrained computer, on the other hand, a photograph of a logbook page is just one big jumble of images.
The researchers in the project are using the Monk handwriting recognition programme, which was developed in Groningen, but this algorithm alone is not enough.
BioSemantics data scientist Katy Wolstencroft and her colleagues are working on an algorithm that can identify the different parts of a layout on a scanned page: what is the table of contents, where is the name of an animal species, and where is its description? Once this programme can recognise these semantics, it will be possible to obtain interlinked data from the report: an illustration of a bat can then be combined with, for example, its name, the location where it was found and the description of its external appearance.
This wealth of data will enable biologists to research the different species of bats that lived on Java in the 19th century, and to compare them with the current bat species. This will give them an insight into their evolution, and perhaps result in the discovery of new species.
Before that stage is reached, however, all kinds of problems need to be solved. ‘The data are of an extremely heterogeneous nature,’ explains Wolstencroft. ‘The reports contain words in different languages: German, Latin, Greek, Dutch, French and Malay. Place names change throughout history, and sometimes new authors added information to a report later.’ It’s not easy to develop a programme that understands such nuances, and leaves them intact.
The content of the expedition reports will ultimately be linked to the species archives of Naturalis, which will undoubtedly lead to new and valuable insights for historians and biologists. But that is not the only aim of the project. ‘We’re developing a generic method for processing historical documents,’ says Wolstencroft. ‘It can also be applied to other collections. In the end, it’s all about being able to share data.’