The curious patterns of Wikipedia growth

Wikipedia is arguably the number one source of information online for the speakers of many languages. But not all the different language editions are developed equally. The English edition is by far the largest and the most complete one, and the other 280 language editions have many fewer articles.

The coverage of different language editions also doesn’t follow a standard template. Some language editions are heavier on politics, for instance, and some have more articles on science related topics, leading to even different populations of controversial topics in different languages. Why does the coverage of different editions vary so much?

You might think it’s to do with the emphasis different cultures place on different subjects, or the ease of explaining a topic in a certain language. But new research has found a surprising pattern among the different editions of Wikipedia. It suggests the shape of the site’s growth is much more complex and tied to the different community of editors who build each edition.

Screen Shot 2017-10-19 at 13.34.01

A recent study, published in the journal Royal Society Open Science, analysed the patterns of some 15,000 article topics that have been covered in at least 26 language editions. The researchers looked at the sequence of languages that each article has appeared on chronically and tried to mine patterns in the trajectory that the article navigates through from one language to another.

Using different computational techniques, they managed to cluster languages into groups that mimic similar coverage patterns. Among the 26 languages that the authors analysed, English, German, and Persian stand out and do not mix with any other groups of languages. But there are three more groups that are mostly robust even when the authors change the algorithm they used for clustering.

Italian, Finish, Portuguese, Russian, Norwegian, Mandarin and Danish stick together. Polish Dutch, Spanish, Japanese, French, and Swedish cluster together. And finally, Indonesian, Turkish, Hungarian, Korean, Ukrainian, Czech, Arabic, Romanian, Bulgarian and Serbian show similar patterns.

What is surprising is that these grouping can’t simply be explained by language families, geographical closeness, or cultural similarities. What seems to be the underlying factor is more related to the characteristics of the community of editors of each language edition.

To test this systematically, the authors considered six factors for each language edition. These included the number of pages, the number of edits, the number of administrators and a measure of the content quality. The other two factors were the total number of active speakers of the language and the level of access they had to the Internet using the international Digital Access Index ranking for the country in which the language is primarily spoken.

These six parameters partially explain the differences between different clusters, but the authors suggest that the clustering of the languages is driven by a more complex combination of socio-economic variables that can capture features such as the average Internet literacy in a country or the general attitude towards the importance of knowledge and education.

The results of this paper become more interesting when compared to an earlier work that looked at the time of the day that edits are mostly committed in each language edition. While generally Wikipedia is edited during the afternoon and early evening, some language editions are being edited more in the morning and some later in the evening.

When you look at these groups of languages, there seem to be similar patterns. Unfortunately the set of languages studied in the two works are not the same and so a direct comparison is not possible.

What this research does is remind us how little we know about how information is being spread on the Internet, what the patterns of the online information landscape are and more importantly, what factors determine these patterns. The role of the Internet and the information resources it provides, in formation of our opinions and decisions that we make at the individual and societal level is undeniable. Answering these questions might help us to achieve a more democratic and unbiased global information repository.


Online movements spread explosively rather than diffusively

I’m very happy that a favourite!! paper of mine is finally published in EPJ Data Science. The paper that is titled “Rapid rise and decay in petition signing” tries to analyse and model the dynamics of popularity of online petitions.

Traditionally, collective action is known to follow a chain-reaction type of dynamics with a critical mass and a tipping point that could be all described with an S-shaped curve (schematically shown in Figure below), however, we spent about 3 years to only fail at finding any type of Sigmoid function that can fit our data!

Screen Shot 2017-10-02 at 19.17.22

The S-curve of success that is not relevant anymore!

Instead, we tried to a fit a multiplicative model with a strong decay modification. That was a much better fit to the data. It grows exponentially at the beginning, but then comes a very rapid decay in the novelty of the movement. Remember, our attention span is very short in the digital age!

Apart from the mathematical details of this fitting exercise, there are important consequences emerging from this observation:

  1. Online collective actions have very different dynamics to what we know from traditional offline movements.
  2. Online movements are explosive and much less predictable.
  3. The typical time-scale of such movements is in the range of hours and few days at longest, not weeks or years!
  4. This fast dynamic is independent of the extent of the success and prevalence of the movement.
  5. Instead of reaching a critical mass in later stages of a movement, one has to try to have a large initial momentum in order to success.

There is more to this obviously and if you’re interested, please have a look at the paper here.

The abstract of the paper reads:

Contemporary collective action, much of which involves social media and other Internet-based platforms, leaves a digital imprint which may be harvested to better understand the dynamics of mobilization. Petition signing is an example of collective action which has gained in popularity with rising use of social media and provides such data for the whole population of petition signatories for a given platform. This paper tracks the growth curves of all 20,000 petitions to the UK government petitions website ( and 1,800 petitions to the US White House site (, analyzing the rate of growth and outreach mechanism. Previous research has suggested the importance of the first day to the ultimate success of a petition, but has not examined early growth within that day, made possible here through hourly resolution in the data. The analysis shows that the vast majority of petitions do not achieve any measure of success; over 99 percent fail to get the 10,000 signatures required for an official response and only 0.1 percent attain the 100,000 required for a parliamentary debate (0.7 percent in the US). We analyze the data through a multiplicative process model framework to explain the heterogeneous growth of signatures at the population level. We define and measure an average outreach factor for petitions and show that it decays very fast (reducing to 0.1% after 10 hours in the UK and 30 hours in the US). After a day or two, a petition’s fate is virtually set. The findings challenge conventional analyses of collective action from economics and political science, where the production function has been assumed to follow an S-shaped curve.



Semantic Network Analysis of Chinese Social Connection (“Guanxi”) on Twitter

About two months ago, a paper of ours with the above title appeared on Frontiers in Digital Humanities (Big Data).

This paper has emerged from my former MSc student at the Oxford Internet Institute, Pu Yan, who is currently working on her PhD in our department.

In this paper we combined a network analysis tool with computational linguistic methods to understand the differences in the ways that Guanxi is conceptualized in two different Chinese cultures (Mainland vs Taiwan, Hong Kong, and Macau).

What I like about this paper is the discussion of the results rather than anything else. Pu, with her great domain knowledge, interprets the results in a very insightful way.

The paper is available here and the abstract says:

Guanxi, roughly translated as “social connection,” is a term commonly used in the Chinese language. In this study, we employed a linguistic approach to explore popular discourses on guanxi. Although sharing the same Confucian roots, Chinese communities inside and outside Mainland China have undergone different historical trajectories. Hence, we took a comparative approach to examine guanxi in Mainland China and in Taiwan, Hong Kong, and Macau (TW-HK-M). Comparing guanxi discourses in two Chinese societies aim at revealing the divergence of guanxi culture. The data for this research were collected on Twitter over a three-week period by searching tweets containing guanxi written in simplified Chinese characters (关系) and in traditional Chinese characters (關係). After building, visualizing, and conducting community detection on both semantic networks, two guanxi discourses were then compared in terms of their major concept sub-communities. This study aims at addressing two questions: Has the meaning of guanxi transformed in contemporary Chinese societies? And how do different socio-economic configurations affect the practice of guanxi? Results suggest that guanxi in interpersonal relationships has adapted to a new family structure in both Chinese societies. In addition, the practice of guanxi in business varies in Mainland China and in TW-HK-M. Furthermore, an extended domain was identified where guanxi is used in a macro-level discussion of state relations. Network representations of the guanxi discourses enabled reification of the concept and shed lights on the understanding of social connections and social orders in contemporary China.

Screen Shot 2017-08-22 at 19.09.02


What’s the state of the art in understanding Human-Machine Networks?

About a month ago, we finished our 2-year long EC-Horizon2020 project on Human-Machine Networks (HUMANE). The first task of this project was to perform a systematic literature review to see what the state of the art in understanding such systems is.

The short answer is that we do not know much! And what we know is not very cohesive. In other words, design, development, and exploration of human-machine systems have been done mostly through trial and error and there has not been much theory or systematic thinking involved.

We wrote a review paper to report on our systematic exploration of the literature. It took us nearly 18 months to finally get the paper published, but it was worth every second waiting as we managed to get it out at the ACM Computing Survey, which has the highest impact factor among all the journals in Computer Science.

Here you can read the paper.

And the abstract says:

In the current hyperconnected era, modern Information and Communication Technology (ICT) systems form sophisticated networks where not only do people interact with other people, but also machines take an increasingly visible and participatory role. Such Human-Machine Networks (HMNs) are embedded in the daily lives of people, both for personal and professional use. They can have a significant impact by producing synergy and innovations. The challenge in designing successful HMNs is that they cannot be developed and implemented in the same manner as networks of machines nodes alone, or following a wholly human-centric view of the network. The problem requires an interdisciplinary approach. Here, we review current research of relevance to HMNs across many disciplines. Extending the previous theoretical concepts of socio-technical systems, actor-network theory, cyber-physical-social systems, and social machines, we concentrate on the interactions among humans and between humans and machines. We identify eight types of HMNs: public-resource computing, crowdsourcing, web search engines, crowdsensing, online markets, social media, multiplayer online games and virtual worlds, and mass collaboration. We systematically select literature on each of these types and review it with a focus on implications for designing HMNs. Moreover, we discuss risks associated with HMNs and identify emerging design and development trends.

Screen Shot 2017-08-22 at 18.47.14.png


Collective Memory in the Digital Age

We finished our project on Collective Memory in the Digital Age: Understanding “Forgetting” on the Internet last summer, but our last paper just came out on Science Advances last week.

The paper, titled “The memory remains: Understanding collective memory in the digital age” presents the results of our study on collective memory patterns based on Wikipedia viewership data of articles related to aviation accidents and incidents.

Combined with our previous paper on Dynamics and biases of online attention, published last year, we mainly claim two things:

Our short-term collective memory is really short; shorter than a week, and it’s biased, and our long-term memory is pretty long, about 45 years, also biased, nevertheless modellable!  And the Internet plays important roles in both observations and also helps us to quantify and study these patterns.

Of course, we have reported few other facts and observations related to our collective memory, but the main message was that.

We report that the most important factor in memory triggering patterns is the original impact of the past event measured by its average daily page views before the recent event occurred. That means that some past events are intrinsically more memorable and our memory of them are more easily triggered. Examples of such events are the crashes related to the 9/11 terrorist attacks.

Time separation between the two events also plays an important role. The closer in time the two events are, the stronger coupling between them; and when the time separation exceeds 45 years, it becomes very unlikely that the recent event triggers any memory of the past event.

The similarity between the two events has turned out to be another important factor; This happens in the case of the Iran Air flight 655 shot down by a US navy guided missile in 1988, which was not generally well remembered but far more attention was paid to it when the Malaysia Airlines 17 flight was hit by a missile over Ukraine in 2014.

3 - press_fig-1

Page-view statistics of three recent flights (2015) and their effects on the page-views of past events from 2014, and events from 1995 to 2000. The recent events cause an increase in the viewership of some of the past events. 

Read the article here, the abstract says:

Recently developed information communication technologies, particularly the Internet, have affected how we, both as individuals and as a society, create, store, and recall information. The Internet also provides us with a great opportunity to study memory using transactional large-scale data in a quantitative framework similar to the practice in natural sciences. We make use of online data by analyzing viewership statistics of Wikipedia articles on aircraft crashes. We study the relation between recent events and past events and particularly focus on understanding memory-triggering patterns. We devise a quantitative model that explains the flow of viewership from a current event to past events based on similarity in time, geography, topic, and the hyperlink structure of Wikipedia articles. We show that, on average, the secondary flow of attention to past events generated by these remembering processes is larger than the primary attention flow to the current event. We report these previously unknown cascading effects.


The interplay between extremism and communication in a collaborative project

Collaboration is among the most fundamental social behaviours.  The Internet and particularly the Web have been originally developed to foster large scale collaboration among scientists and technicians. The more recent emergence of Web 2.0 and ubiquity of user-generated content on social web, has provided us with even more potentials and capacities for large scale collaborative projects. Projects such as Wikipedia, Zooniverse, Foldit, etc are only few examples of such collective actions for public good.

Despite the central role of collaboration in development of our societies, data-driven studies and computational approaches to understand mechanisms and to test policies are rare.

In a recent paper titled “Understanding and coping with extremism in an online collaborative environment: A data-driven modeling” that is published in PLoS ONE, we use an agent-based modelling  framework to study opinion dynamics and collaboration in Wikipedia.

Our model is very simple and minimalistic and therefore the results can be generalized to other examples of large scale collaboration rather easily.

We particularly focus on the role of extreme opinions, direct communication between agents, and punishing policies that can be implemented in order to facilitate a faster consensus.

The results are rather surprising! In the abstract of the paper we say:

… Using a model of common value production, we show that the consensus can only be reached if groups with extreme views can actively take part in the discussion and if their views are also represented in the common outcome, at least temporarily. We show that banning problematic editors mostly hinders the consensus as it delays discussion and thus the whole consensus building process. We also consider the role of direct communication between editors both in the model and in Wikipedia data (by analyzing the Wikipedia talk pages). While the model suggests that in certain conditions there is an optimal rate of “talking” vs “editing”, it correctly predicts that in the current settings of Wikipedia, more activity in talk pages is associated with more controversy.

Read the whole paper here!


This diagram shows the time to reach consensus (colour-coded) as a function of relative size of the extreme opinion groups (RoE) and the rate of direct communication between agents (r) in four different scenarios. 


Using Twitter data to study politics? Fine, but be careful!

The role of social media in shaping the new politics is undeniable. Therefore the volume of research on this topic, relying on the data that are produced by the same technologies, is ever increasing. And let’s be honest, when we say “social media” data, almost always we mean Twitter data!

Twitter is arguably the most studied and used source of data in the new field of Computational Political Science, even though in many countries Twitter is not the main player. But we all know why we use Twitter data in our studies and not for instance data mined from Facebook: Twitter data are (almost) publicly available whereas it’s (almost) impossible to collect any useful data from Facebook.

That is understandable. However, there are numerous issues with studies that are entirely relying on Twitter data.

In a mini-review paper titled “A Biased Review of Biases in Twitter Studies on Political Collective Action“, we discussed some of these issues. Only some of them and not all, and that’s why we called our paper a “biased review”.

The reason that I’m reminding you of the paper now is mostly the new surge of research on “politics and Twitter” in relation to the recent events in the UK, US, and the forthcoming elections in European countries this summer.

Here is the abstract:

In recent years researchers have gravitated to Twitter and other social media platforms as fertile ground for empirical analysis of social phenomena. Social media provides researchers access to trace data of interactions and discourse that once went unrecorded in the offline world. Researchers have sought to use these data to explain social phenomena both particular to social media and applicable to the broader social world. This paper offers a minireview of Twitter-based research on political crowd behavior. This literature offers insight into particular social phenomena on Twitter, but often fails to use standardized methods that permit interpretation beyond individual studies. Read more….


Social Media: an illustration of overestimating the relevance of social media to social events from XKCD. Available online at