Leiden hosts first knowledge café on text and data mining
Text and data mining: searching and analysing large quantities of texts, images or other data. On 29 February during a Knowledge Café in the Academy Building visitors discussed the question of why this method isn't being widely applied in Europe.
The Knowledge Café was organised by FutureTDM, a project on text and data mining financed by the European Union. This was the first in a series of meetings to be held throughout this year. The Leiden Centre of Data Science (LCDS) was co-organiser of this first event that was attended by some 20 scientists and representatives of academic libraries.
LCDS Director Jaap van den Herik opened the discussion. He started by explaining that text and data mining is interesting for scientists from all kinds of disciplines, but it is not widely understood. As an example of what the technology involves, he mentioned text analysis. 'That can be at the level of words and sentences, but you can also have a computer search for themes.' An example is the Biblical theme of the flood. 'Non-Christian books from the same period as the Bible also mention the flood, but the remarkable thing is that if you search in other texts for the birth of Christ, you don't find any mention of it anywhere other than in the Bible. Text mining helps you explore the context of different phenomena.’
Huge amounts of data
Text mining is ideally suited for studying even larger amounts of data than this example, Van den Herik explained. 'Its strength lies above all in the combination of different types of data.'
Biodiversity and ebola
At the LCDS, mathematicians and information technologists work on techniques that make this kind of research possible. Physicists and astronomers are already conducting big-data research and there are many collaborations with scientists from different disciplines, such as the joint project with Naturalis on biodiversity protection. And Leiden scientists also helped chart the spread of Ebola in 2014 on the basis of big data, including data submitted via mobile telephones.
Europe lagging behind
Text and data mining is currently huge, according to Susan Reilly, chairman of the organisation of European academic libraries LIBER and one of the initators of FutureTDM. ‘It has areally taken off in the United States and Japan, but Europe is lagging behind. We want to find out why that is and propose some solutions to bring Europe up to date.’
One of the obstacles is clear, Reilly explained: copyright laws are much stricter in Europe than in the US. Scientists often can't publish the results of their data-mining research because the sources on which they are based are protected by copyright. Reilly: ‘The European Commission has spoken out in favour of text and data mining and there is the possibility of a legal exception to the copyright laws in order to facilitate its use.' An important question that still has to be answered is who would be permitted to make use of this exception: only scientists, or also commercial parties such as pharmaceutical companies?
These were some of the issues discussed by those attending the Knowledge Café. Van den Herik also pointed out a series of obstacles at the University that get in the way of the use of this technology. 'Researchers don't understand it well enough; they need to learn more about it. We don't have enough high-performance computing equipment at the University. And many collections are not publicly available; the raw data first has to be made accessible.' But, he added, this first FutureTDM meeting did not take place in Leiden by chance. 'The Netherlands is ahead of the field in big data research within Europe.'