Making everything we know computer-readable
Data and information should be stored in a way that computers can understand, says Barend Mons, professor of Biosemantics at the Leiden University Medical Center and Chair of the High Level Expert Group for the European Open Science Cloud. We speak with him about FAIR data, knowlets and nanopublications.
Barend Mons started his career as a molecular biologist. After years of research on malaria, he switched to bioinformatics some fifteen years ago. His new mission: mapping all the knowledge and data we have, and storing in a computer-friendly format. This way, he argues, computers will be able to discover links and patterns by themselves. In order to achieve this, not only smart software and technology are needed, but also a different mindset within academia.
Computers do not understand text well. Why is that?
‘Language is a nightmare for computers. In science - just like everywhere else - we use a variety of terms to describe the same concept. There are many different languages, plus we like to use synonyms and figures of speech. The human brain can deal with all these different terms; we can, for instance, understand that malaria and paludism are the same disease. But the computer sees no connection between the two, without instructions. When I realised that years ago, I started working in the field of biosemantics: a discipline that tries to map all the terms and concepts we use in biology, and how they are related, in a way that is understandable to computers. Because once we have a system in which all knowledge is stored that way, a Semantic Web, science can start using computers in much more advanced ways.’
What does it take to make knowledge accessible to computers?
‘The most important thing is that we learn to deal with data in a different way. You can first publish an article and then convert it to a format that the computer can work with, but it's much more efficient to publish in a computer-readable format from the start. Also, it is very important that data can be reused by others. To that end, we have developed the FAIR principles: all scientific data should be Findable, Accessible, Interoperable and Reusable by people and computers. The essence of FAIR is: never refer, not even a cell in your database, to something that the computer cannot understand.’
You and your research group have played an important role in the development of nanopublications. What are those?
‘A nanopublication is the smallest unit of publishable information: a simple, computer-readable assertion in the form of 'a does something to b'. For example, the phrase ‘malaria is transmitted by mosquitoes’. Such statements can be treated as full publications if we simply add information about their provenance: author, date, peer reviewed or not, and so on.’
And what can we do with these nanopublications?
‘Imagine that the entire Explicitome - that is, all the knowledge that is stored in databases and scientific articles - could be converted into nanopublications. Since the computer can read these, it will be able to organize them into a system of knowlets: concept clouds that contain all the assertions that have been made about a certain concept, with an added ‘weight’. We have already developed such a system for the life sciences. Every concept we talk about - every gene, every illness, every molecule - has its own knowlet. Therefore, we can see exactly how all these different concepts are related. With large amounts of data and knowledge stored in the system, we are starting to see connections that we did not know existed. Or, even better, the computer will see connections, and it will notify us. It can let us know, for example, that a particular gene and a certain disease seem to have something to do with each other. Then we can do further research on that.’
But capturing all the knowledge we have in such a system - that must be a lot of work?
‘Yes, for the life sciences we are currently talking about some three million knowlets and 1014 nanopublications. We are not there yet, but we have solved the main fundamental problems by now. For example, we’ve developed a system that allows us to convert scientific articles into nanopublications. Using text mining, we let the computer scan articles and distill nanopublications from them. That way we extract knowledge from scientific publications, and we put all new information in a database that is accessible to everyone. We don’t even need Open Access for this process.’
Is it all completely legal?
‘Yes, because technically, a nanopublication is nothing but a citation. You read an article and quote a sentence from it, referring back to the original publication. The only difference is that we now store these citations in a computer-readable format. Of course it is frowned upon a little by the academic publishers, who are afraid to lose their firewall. But in the end, publishers also benefit from it: a nanopublication is a great advertisement for the underlying article. By now, publishers are starting to realise that, too.’
But what about the researchers themselves - are they ready to share their data?
‘If you ask researchers whether they are, 98% will say yes. But if you ask if they actually do it, only 60% will raise their hands. There are a few majour hurdles for sharing data. Firstly, there is a lack of expertise: formatting data in a FAIR way is not very complicated, but you have to know how to do it. In the coming years we need to educate people to assist researchers with this. We also organize BYOD (Bring Your Own Data) workshops, where we help researchers make their own dataset FAIR. But perhaps the main obstacle to data sharing is that there is simply too little incentive for researchers.’
Because they are still judged on the articles that they publish?
‘Exactly. We still evaluate scientists on criteria from the nineteenth century - which really makes no sense anymore. Articles are no longer the centre of scientific communication: to what extent a researcher produces and shares data, is much more important nowadays. But some scientists want things to stay the way they were. In my view, researchers who still think they can stay ahead by keeping their data to themselves, will be sidelined within a few years. They will no longer achieve any breakthroughs; all the low-hanging fruit has already been picked. Now, after fifteen years of fighting the establishment, I can see things are finally starting to change. In many countries, data standards are now based on the principles of FAIR.’
Will this also have an impact outside academia?
‘Yes. For example, some hospitals already have FAIR data stations, in which they use large amounts of data to improve their own care, or to analyse the effects of drugs. Together with a number of universities and academic hospitals, we are now translating this to the level of smartphones: the Personal Health Train. Within this system, your personal health data will be stored in a safe on your smartphone, in the form of FAIR data. If someone knocks and wants to look at your data - that could be a supermarket chain, but also your doctor, or a patient association - it is up to you to decide who gets access and who does not. This is a safer way to store data than the way it is currently done. It is also a great solution for the dreadful electronic patient record (EPD) we currently have in The Netherlands, for which hospitals use a variety of systems that cannot communicate with one another.’
What are the possibilities of this Personal Health Train?
‘For instance, your pharmacist can intervene, if your doctor has prescribed you a drug that does not combine well with a particular gene you have. Or a patient association can investigate the effects of a particular drug or diet, based on large amounts of data. The possibilities are endless. The supercomputer of the future is not a cathedral full of computers; it is seven billion smartphones.’
Barend Mons started his career as a molecular biologist and received his PhD from Leiden University. After years of research on malaria, he switched to science management. He worked at the Research Directorate of the European Commission and the Netherlands Organisation for Scientific Research. Afterwards, Mons returned to science; he worked at the Department of Medical Informatics at the Erasmus Medical Center, and has been professor of Biosemantics at Leiden University Medical Center since 2013. He is also Head of Node for ELIXIR-NL at the Dutch Techcentre for Life Sciences, and Integrator Life Sciences at the Netherlands eScience Center. Mons is the initiator of FAIR and was appointed Chair of the EU High Level Expert Group ‘European Open Science Cloud’ in 2015.
(JvdB)
This article is part of a series of interviews with researchers from the Leiden Centre of Data Science (LCDS). LCDS is a network of researchers from different scientific disciplines, who use innovative methods to deal with large amounts of data. Collaboration between these researchers leads to new solutions to problems in science and society.