How biased information on the internet can influence research

06 March 2019

The internet is a good place to rapidly collect large amounts of data. But if you don’t watch out you’ll collect very biased, one-sided data. These were the warning words of the speakers at a symposium on 5 March about transparency and responsible data science.

Chilean computer scientist Ricardo Baeza-Yates explained the different types of bias that can influence research results. A person’s culture, region of the world, level of education and age can make a big difference, said Baeza-Yates, a professor at Northeastern University in America. One example: more than half of all websites are in English. If you only select data from English websites, you immediately exclude a large number of languages and cultures.

Speaker Ricardo Baeza-Yates. The symposium was held in the PLNT innovation centre in Leiden.

Small group provides information

Also be aware of this when collecting data from tweets, he said. The majority of Twitter users are highly educated men in Western countries. What is more, a small group of users tend to generate over half of the content. For Twitter, this is about 2% of users and for Facebook about 7%. Online reviews and Wikipedia entries are also mainly written by white Western men who have time to do this, said Baeza-Yates: ‘Women: I know you’re busy, but please write more on Wikipedia.’

The symposium attracted lots of students and researchers.

Filter bubble

If you look something up on the internet, regardless whether you are a researcher or consumer, you should therefore take a very critical look at your sources. The internet is teeming with fake news and fake reviews. Not to mention the filter bubble: search engines sort information based on your previous click behaviour, so you only get to see a selection of the available information. Baeza-Yates also referred to the in-built limitations of automatic suggestions such as tags. ‘People are lazy and choose the suggested tags. This works like a self-fulfilling prophecy and reinforces the filter bubble.’

Be aware of your own bias

Baeza-Yates ended with a word of caution to the students and researchers: make sure your research material is diverse and don’t just follow the quick and easy path on the internet but dare to leave the beaten track and explore less obvious sources too. ‘And above all, be aware of your own bias. As a researcher, you begin by definition with a bias because your work is based on assumptions.’

Machine learning

Mireille Hildebrandt, Professor of Interfacing Law and Technology at Vrije Universiteit Brussel, warned about the legal implications of machine learning. Researchers are making increasing use of self-learning models that can identify patterns in big data. These patterns are used to develop a predictive algorithm, a mathematical model, and this algorithm can learn and discover new patterns in new data. You end up with a mountain of research data that could be used for many different things – except that from a legal perspective it often can’t: for instance because it’s not honest to the respondents who didn’t consent to all forms of research. Hildebrandt warned against such ‘sloppy’ uses of research data.

Report changes

From the point of view of the law, research data can only be used for the primary research aim. Researchers who use these self-learning models should register this research method and report any changes to the research. ‘Always take into account that changes to the research can have both technical and legal consequences. Report any changes. That creates more integrity in the research,’ said Hildebrandt.

Blood donors

During the break, the participants reflected on their own research. Marieke Vinkenoog, a PhD candidate in the Data Science Research Programme, is working with data from Sanguin blood bank. She is aware that her research population is not completely representative of the entire research population: blood donors are usually highly educated white women. Manon Wintgens, a PhD candidate in Tax Law, emphasised the importance of a solid research plan and clear agreements with respondents. ‘Otherwise, you discover too late that you can’t publish certain data at all.’

Photos Monique Shaw