Universiteit Leiden

nl en

Student maps Chinese language variation

When Daan van Esch, master’s student in Chinese Studies, travelled through China last summer, he noticed that he often did not understand what the inhabitants of the different villages and cities were talking about. There turned out to be huge differences within the language. He decided to map this linguistic diversity.

Twitter Corpus

‘I had of course read that they speak differently in every little village, but it took this trip for me to notice how true that is,’ says Van Esch. ‘In the US there is a Twitter Corpus which records dialect words and their places of origin. So I thought: I can do that too!’ The result is the Leiden Weibo Corpus. Weibo is the Chinese variant of Twitter; 300 million Chinese have a Weibo account. A corpus is a large collection of linguistic material which is structured in such a way that users can easily search it.

Map of the Weibo Corpus. The dots are places where tweets containing the word 'key' were sent.
Map of the Weibo Corpus. The dots are places where tweets containing the word 'key' were sent.


Five million messages

Van Esch would have liked to map the differences between the Chinese villages but such a project would be time-consuming and would also require an enormous amount of fieldwork. Instead he downloaded more than five million messages from Weibo. Since the messages also contained information on the place where the message was uploaded, you can see which words appear in which parts of China.


The software that was used to analyse the messages was written by Van Esch himself. He taught himself how to do this. Once the software was ready, analysing the millions of messages only took twelve hours. This was made possible by the fact that Van Esch could rent a super-fast server through the Cloud, a way of connecting with other computers through internet. An investment that cost him a mere 8 dollar and 21 cents, but saved him a lot of time. The software recognises grammatical patterns in Chinese and locates the places from which the messages were posted on a map.

Snow, love and insomnia

Van Esch finds it remarkable that the media focuses so much on political messages on social networks such as Twitter and Weibo, while the messages are actually primarily about daily life. Many Chinese people talk about the snow (the messages were collected in January when there was a big snow storm), love and the fact that they cannot sleep.

Two thousand visitors

His corpus is already being used by other students and researchers in an NWO research project being conducted by the Leiden University Centre for Linguistics. Following an e-mail sent to various media the website had two thousand visitors within the first month.

And now? On to a PhD?

Van Esch does not yet know what he wants to do after his Master’s. He just recently submitted his MA thesis and is still waiting for the assessment. ‘I would like to continue with this work and do a PhD; but I don’t know yet where and how. First a holiday!’

(6 June 2012 / Nelleke Groot, student Journalism and New Media)

This website uses cookies.  More information.