Wikipedia is arguably the number one source of information online for the speakers of many languages. But not all the different language editions are developed equally. The English edition is by far the largest and the most complete one, and the other 280 language editions have many fewer articles.
The coverage of different language editions also doesn’t follow a standard template. Some language editions are heavier on politics, for instance, and some have more articles on science related topics, leading to even different populations of controversial topics in different languages. Why does the coverage of different editions vary so much?
You might think it’s to do with the emphasis different cultures place on different subjects, or the ease of explaining a topic in a certain language. But new research has found a surprising pattern among the different editions of Wikipedia. It suggests the shape of the site’s growth is much more complex and tied to the different community of editors who build each edition.
A recent study, published in the journal Royal Society Open Science, analysed the patterns of some 15,000 article topics that have been covered in at least 26 language editions. The researchers looked at the sequence of languages that each article has appeared on chronically and tried to mine patterns in the trajectory that the article navigates through from one language to another.
Using different computational techniques, they managed to cluster languages into groups that mimic similar coverage patterns. Among the 26 languages that the authors analysed, English, German, and Persian stand out and do not mix with any other groups of languages. But there are three more groups that are mostly robust even when the authors change the algorithm they used for clustering.
Italian, Finish, Portuguese, Russian, Norwegian, Mandarin and Danish stick together. Polish Dutch, Spanish, Japanese, French, and Swedish cluster together. And finally, Indonesian, Turkish, Hungarian, Korean, Ukrainian, Czech, Arabic, Romanian, Bulgarian and Serbian show similar patterns.
What is surprising is that these grouping can’t simply be explained by language families, geographical closeness, or cultural similarities. What seems to be the underlying factor is more related to the characteristics of the community of editors of each language edition.
To test this systematically, the authors considered six factors for each language edition. These included the number of pages, the number of edits, the number of administrators and a measure of the content quality. The other two factors were the total number of active speakers of the language and the level of access they had to the Internet using the international Digital Access Index ranking for the country in which the language is primarily spoken.
These six parameters partially explain the differences between different clusters, but the authors suggest that the clustering of the languages is driven by a more complex combination of socio-economic variables that can capture features such as the average Internet literacy in a country or the general attitude towards the importance of knowledge and education.
The results of this paper become more interesting when compared to an earlier work that looked at the time of the day that edits are mostly committed in each language edition. While generally Wikipedia is edited during the afternoon and early evening, some language editions are being edited more in the morning and some later in the evening.
When you look at these groups of languages, there seem to be similar patterns. Unfortunately the set of languages studied in the two works are not the same and so a direct comparison is not possible.
What this research does is remind us how little we know about how information is being spread on the Internet, what the patterns of the online information landscape are and more importantly, what factors determine these patterns. The role of the Internet and the information resources it provides, in formation of our opinions and decisions that we make at the individual and societal level is undeniable. Answering these questions might help us to achieve a more democratic and unbiased global information repository.