Even good bots fight and a typology of Internet bots

Our new paper titled “Even good bots fight: The case of Wikipedia” has finally appeared on PLOS One.

There are two things that I particularly find worth-highlighting about this work. First, this is the first time that someone looks at an ecosystem of the Internet bots at scale using hard data and tries to come up with a typology of the Internet bots (see the figure). And second, the arrangement of our team that is a good example of multidisciplinary research in action: Milena Tsvetkova, the lead author is a sociologist by training. Ruth Garcia is a computer engineer, Luciano Floridi is a professor of Philosophy, and I have a PhD in physics.

If you find the paper too long, have a look at the University of Oxford press release, or the one by the Alan Turing Institute, where both Luciano and I are Faculty Fellows.

Among many media coverages of our work, I think the one in The Guardian is the closest to ideal.


A first typology of the Internet bots. See the source.


New Paper: Personal Clashes and Status in Wikipedia Edit Wars


Originally posted on HUMANE blog by Milena Tsvetkova.

Our study on disagreement in Wikipedia was just published in Scientific Reports (impact factor 5.2). In this study, we find that disagreement and conflict in Wikipedia follow specific patterns. We use complex network methods to identify three kinds of typical negative interactions: an editor confronts another editor repeatedly, an editor confronts back an equally experienced attacker, and less experienced editors confront someone else’s attacker.

Disagreement and conflict are a fact of social life but we do not like to disclose publicly whom we dislike. This poses a challenge for scientists, as we rarely have records of negative social interactions.

To circumvent this problem, we investigate when and with whom Wikipedia users edit articles. We analyze more than 4.6 million edits in 13 different language editions of Wikipedia in the period 2001-2011. We identify when an editor undoes the contribution by another editor and created a network of these “reverts”.

A revert may be intended to improve the content in the article but may also indicate a negative social interaction among the editors involved. To see if the latter is the case, we analyze how often and how fast pairs of reverts occur compared to a null model. The null model removes any individual patterns of activity but preserves important characteristics of the community. It preserves the community structure centered around articles and topics and the natural irregularity of activity due to editors being in the same time zone or due to the occurrence of news-worthy events.

Using this method, we discover that certain interactions occur more often and during shorter time intervals than one would expect from the null model. We find that Wikipedia editors systematically revert the same person, revert back their reverter, and come to defend a reverted editor beyond what would be needed just to improve and maintain the encyclopedia objectively. In addition, we analyze the editors’ status and seniority as measured by the number of article edits they have completed. This reveals that editors with equal status are more likely to respond to reverts and lower-status editors are more likely to revert someone else’s reverter, presumably to make friends and gain some social capital.

We conclude that the discovered interactions demonstrate that social processes interfere with how knowledge is negotiated. Large-scale collaboration by volunteers online provides much of the information we obtain and the software products we use today. The repeated interactions of these volunteers give rise to communities with shared identity and practice. But the social interactions in these communities can in turn affect knowledge production. Such interferences may induce biases and subjectivities into the information we rely on.

Understanding voters’ information seeking behaviour

Jonathan and I recently published a paper titledWikipedia traffic data and electoral prediction: towards theoretically informed models in EPJ Data Science.

In this article we examine the possibility of predicting election results by analysing Wikipedia traffic going to different articles related to the parties involved in the election.

Unlike similar work in which socially generated online data is used in an automated learning system to predict the electoral results, without much understanding of mechanisms, here we try to provide a theoretical understanding of voters’ information seeking behaviour around election time and use that understanding to make predictions.


Left panel shows the normalized daily views of the article on the European Parliament Election, 2009 in different langue editions of Wikipedia. The right panel shows the relative change between 2009 and 2014 election turnout in each country vs the relative change in the page view counts of the election article in the corresponding Wikipedia language edition. Germany and Czech Republic are marked as outliers from the general trend.

We test our model on a variety of countries in the 2009 and 2014 European Parliament elections. We show that Wikipedia offers good information about changes in overall turnout at elections and also about changes in vote share for parties. It gives a particularly strong signal for new parties which are emerging to prominence.

We use these results to enhance existing theories about the drivers of aggregate patterns in online information seeking, by suggesting that:

voters are cognitive misers who seek information only when considering changing their vote.

This shows the importance of informal online information in forming the opinions of swing voters, and emphasizes the need for serious consideration of the potentials of systems like Wikipedia by parties, campaign organizers, and institutions which regulate elections.

Read more here.

Elections and Social Media Presence of the Candidates

Some have called the forthcoming UK general election a Social Media Election. It might be a bit of exaggeration, but there is no doubt that both candidates and voters are very active on social media these days and take them seriously. The Wikipedia-Shapps story of last week is a good example showing how important online presence is for candidates, journalists, and of course voters. We don’t know how important this presence is in terms of shaping the votes, but at least we can look into the data and gauge the presence of the candidates and the activity of the supporters. In this post and some others we present statistics of online activity of parties, candidates, and of course voters. For an example, see the previous post on the searching behaviour of citizens around the debate times.

Who is on Twitter?

Candidates and parties are very much debated by supporters on social media, particularly Facebook and Twitter. But how active are candidates themselves on these platforms? In this post we show simply how many candidates from each party and in which constituencies have a Twitter account. Some of them might be more active than others and some might tweet very rarely, and we will analyse this activity in the next posts. Here we count only who has any kind of publicly known account.


Geographical distribution of candidates who have Twitter account.

The figure above shows the geographical distributions of candidates for each party and whether they have a Twitter account. There are some interesting results in there. For example, Labour has the largest number of Twitter-active candidates, whereas ALL the SNP candidates tweet. While LibDem and Green parties have the same number of accounts, normalised by the overall number of constituencies that they are standing in, Green seems to be more Twitter-enthusiastic. UKIP loses the Twitter game both in absolute number and proportion.

Who is on Wikipedia?

Having a Twitter account is something of a personal decision.  A candidate decides to have one and it’s totally up to them what to tweet. The difference in the case of Wikipedia, is that ideally candidates would not create or edit one about themselves. Also the type of information that you can learn about a candidate on their Wikipedia page is very different to what you can gain by reading their tweets.

Geographical distribution of the candidates, whom Wikipedia has an article about.

Geographical distribution of the candidates, whom Wikipedia has an article about.

The figure above shows the constituencies that the candidates standing in are featured in the largest online encyclopaedia, Wikipedia. Here, Tories are the absolute winners, in terms of the number of articles. Greens are the least “famous” candidates and LibDem are well behind the big two. In the next post we will explore often voters turn to Wikipedia to learn about the parties and candidates, and I’m sure by reading that you’ll be convinced that being featured on Wikipedia is important!


All right, so far, Labour won Twitter presence and Tories took Wikipedia (remember all the SNP’s also have a Twitter account). But how about the gender of the candidates? Is there any gender-related feature in social presence pattern of the candidates?

First let’s have a look at the gender distribution of the candidates.

Geographical distribution of the candidates colour-coded by gender.

Geographical distribution of the candidates colour-coded by gender.

As you see in the figure above, there are fewer female candidates than male ones across all the parties. Only 12% of the UKIP candidates are female while the Greens have the highest proportion at 38%. Tories sit right next to UKIP on the list of the most male oriented parties. There is also a clear pattern that most of the constituencies in the centre have male candidates.

How about social media?

Among all the candidates, 20% of male candidates are featured in Wikipedia, whereas this is about 17% for female candidates. Almost half of the Tories male candidates are in Wikipedia, whereas this goes down to 28% for their female counterparts. Only Labour female candidates have more coverage in Wikipedia compared to the males of the party, but the difference is marginal. ّIn all the other parties, males have a higher coverage rate. The tendency of Wikipedia to pay more attention to male figures is a very well known fact. 

Twitter is different. Slightly more female candidates (76%) have a Twitter account than male candidates (69%). Almost all (96%) of Labour females tweet, and Tory female candidates are more active than their male candidates. This pattern however is lost for the UKIP candidates, as 52% of their males are on Twitter compared to only 44% of their female candidates (who have the lowest rate among all the party-gender groups).


The data that we used to produced the maps and figures come mainly from a very interesting crowd-sourced project called yournextmp. However, we further validated the data using the Wikipedia and Twitter API’s. If you want to have a copy, just get in touch!

How much Wikipedia could tell us about elections

IMPORTANT NOTE: this post does not aim at predicting the results of any election. This is just a report on some publicly available data and does not draw any conclusion on it. 

In few hours, vote casting for Iranian presidential election, 2013 starts. And within few days (may be one or two) the next president of Iran for the forthcoming four years will be officially announced. This is not only an important event for all Iranians but it also could significantly impact the short or even long term history of the region and even the world, given the complicated internal and international political situation of Iran. Clearly this discussion is out of my expertise and interests and is not the goal of this post.


One of the main differences between Iranian elections and many other countries’ is that most of the time, the candidates are not known until very close to the election date. The process of self-nomination (registration), and then approval and pre-selection of candidates by the Guardian Council, and official announcement of campaigning candidates is rather complicated and unpredictable. In short, almost no one knows the candidates until about a month before election dates.

The rather short period of election campaigns makes it very important how to inform the voters about the programmes and plans of the candidates as well as their previous political biography. Of course online material and social networking could play an important role in bridging between candidates and voters. Among others, Wikipedia is one of the sources that citizens refer to in order to gather at least some basic information about the candidates.

This time, there have been 8 candidates officially announced by the Ministry of Interior, from which 2 have withdrawn later. I did a simple count on the number of edits, number of unique editors, and number of page views of the Persian Wikipedia pages of those 8 candidates from May 7th (start of registration) up to now.  The results are presented in the following chart. To my surprise, there hasn’t been massive editorial work on the pages within this period (180 edits at most). However, page view numbers are relatively large, with a maximum of 180,000 hits during the same period and for the same candidate with the maximum number of edits by maximum number of unique editors. If I were a candidate, I’d have put more effort in order to complete and groom my Wikipedia page! As it’s quite visible!

More interestingly, those candidates with higher page view statistics are commonly known to have higher chances of success according to official and unofficial polls during the last few weeks (I don’t believe in any kind of  survey-based opinion mining, by the way!).

Another interesting aspect of page view statistics, is of course its temporal evolution. In the next diagram I show the number of daily views for the top-4 candidates (according to the total number of page views and excluding Aref, who has withdrawn).


On May 21st, the final list of 8 candidates was announced and it’s the reason for the second peak in all 4 lines and it’s even higher for Jalili because his acceptance as a candidate was kind of a surprise and people apparently has started to know him more. The following bumps in the page view numbers of candidates are mainly due to their presence in either live TV debates or their campaign meetings. Finally, the most interesting and relevant jump is the one of Rouhani, just 2-3 days ago.Among those 4 candidate, Jalili was the least expected and known candidate who registered on the last day of registration and it produced the first peak in his page views.

The only significant event during this period was the withdrawal of Aref, which could be seen as a supportive action for Rouhani (although never mentioned explicitly).

I’d like to emphasise that I’m not trying to do any prediction based on this low-dimensional, sparse data, but if you are interested in predictions, see our soon-to-be-published paper on Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data or read about it in the Guardian.

Wikipedia; modern platform, ancient debates on Land and Gods

What are the most controversial topics in Wikipedia? What articles have been subject to edit wars more than others? We now have a tool to explore what topics are most controversial in different languages and different parts of the world.

Wikipedia is great! There is no doubt about it. You may argue that it’s not reliable, it’s incomplete, it’s biased, etc, and I might agree. However, despite all these issues, Wikipedia IS useful, fast, practical and phenomenal!

Do you have any other example of a mass collaboration at the scale of Wikipedia with more 40 million editors, having produced more than 37 million articles in more than 280 languages?

Coordinating a small group of friends becomes a big issue when it’s about collaboration and reaching agreement on some topic, how is that possible that this huge number of unprofessional individuals with different backgrounds, cultures, opinions, come together and produce the largest encyclopaedia of all times?

Well, the answer is: it’s not easy and it’s not always smooth. Many Wikipedia articles are about neutral topic, like watermelon and hamsters. But there are lots of editorial wars and opinion clashes happening behind the scenes of Wikipedia as well. What are the main characteristics of these wars? What are the most disputed articles? Does it give us a window to how humans of different parts of the world think about stuff? It’s not difficult to observe some of the editorial wars in English Wikipedia, for example see the list of controversial issues in Wikipedia. But first of all there is no guarantee that these lists are inclusive, and more importantly, such lists are only available for the biggest language editions like English Wikipedia.

There have been already nice studies on Wikipedia conflict, but unfortunately only limited to English Wikipedia. In a recent multidisciplinary project (see the paper), my colleagues Anselm Spoerri (communication and Information scientist), Mark Graham (geographer) , János Kertész (senior physicist), and I (physicist in transition to computational social scientist) studied Wikipedia editorial wars in 13 different language editions including: English, German, French, Spanish, Portuguese, … Persian, Arabic, Hebrew, … Czech, Hungarian, Romanian, …. Chinese and Japanese.

We have developed our tools to locate, quantify, and rank the most controversial articles in different language editions without being able to read the language! Our method to measure editorial wars has been reported in our previous papers on Dynamics of conflicts in Wikipedia and Edit wars in Wikipedia.

Now that we have measures of controversy for all the articles in the language editions under study, we could have lots of fun!

First take a look at the awesome post by Mark on mapping conflict and geographical locations of the controversial articles, and then I’ll tell you something about most debated topics in different language editions.

Here’s the top-10 list of most controversial articles in different languages:

English German French Spanish Portuguese Czech Hungarian  Romanian Arabic Persian Hebrew Japanese Chinese
1 George W. Bush Croatia Ségolène Royal Chile São Paulo Homosexuality Gypsy Crime FC Universitatea Craiova Ash’ari Báb Chabad Koreans in Japan Taiwan
2 Anarchism Scientology Unidentified flying object Club América Brazil Psychotronics Atheism Mircea Badea Ali bin Talal al Jahani Fatimah Chabad messianism Korea origin theory List of upcoming TVB series
3 Muhammad 9/11 conspiracy theories Jehovah’s Witnesses Opus Dei Rede Record Telepathy Hungarian radical right Disney Channel (Romania) Muhammad Mahmoud Ahmadinejad 2006 Lebanon War Men’s rights TVB
4 List of WWE personnel Fraternities Jesus Athletic Bilbao José Serra Communism Viktor Orbán Legionnaires’ rebellion & Bucharest pogrom Ali People’s Mujahedin of Iran B’Tselem internet right-wing China
5 Global warming Homeopathy Sigmund Freud Andrés Manuel López Obrador Grêmio Foot-Ball Porto Alegrense Homophobia Hungarian Guard Movement Lugoj Egypt Criticism of the Quran Benjamin Netanyahu AKB48 Chiang Kai-shek
6 Circumcision Adolf Hitler September 11 attacks Newell’s Old Boys Sport Club Corinthians Paulista Jesus Ferenc Gyurcsány’s speech in May 2006 Vladimir Tismăneanu Syria Tabriz Jewish settlement in Hebron Kamen Rider Series Ma Ying-jeou
7 United States Jesus Muhammad al-Durrah incident FC Barcelona Cyndi Lauper Moravia The Mortimer case Craiova Sunni Islam Ali Khamenei Daphni Leef One Piece Chen Shui-bian
8 Jesus Hugo Chávez Islamophobia Homeopathy Dilma Rousseff Sexual orientation change efforts Hungarian Far- right Romania Wahhabi Ruhollah Khomeini Gaza War Kim Yu-Na Mao Zedong
9 Race and intelligence Minimum wage God in Christianity Augusto Pinochet Luiz Inácio Lula da Silva Ross Hedvíček Jobbik Traian Băsescu Yasser Al-Habib Massoud Rajavi Beitar Jerusalem F.C. Mizuho Fukushima Second Sino-Japanese War
10 Christianity Rudolf Steiner Nuclear power debate Alianza Lima Guns N’ Roses Israel Polgár Tamás Romanian Orthodox Church Arab people Muhammad Ariel Sharon GoGo Sentai Boukenger Tiananmen Square protests of 1989

Interesting and familiar titles, right? Did you realise that some titles appear in many different language editions? Many of them are about religion: Jesus; countries: Israel, Brazil; politics: Ségolène Royal, George W. Bush.


If you’d  like to take a look at the top-100 or in case you fancy having the complete lists with controversy score, get them from here.

What you see at the right is  a Word Cloud of all the titles in top-100 lists.

There are interesting patterns. Similarities and differences. International and global issues and very local items. An interactive visualization of top-100 lists in different languages to show overlaps and similarities, is waiting for you here.

To have a more general picture, we would have to look further than just “titles”. We need to consider more general topics and concepts, which the articles  can be categorised based on.

We hand-coded all the articles in top-100 lists with 10 different category tags. See the population of topical categories in each language in the interactive chart below (click on it!).


Some interesting patterns: Religion and Politics are debated in Persian, Arabic, and Hebrew even more than the others.  Spanish and Portuguese Wikipedias are full of wars on football clubs. French and Czech Wikipedias have relatively more disputed articles on science and technology related topics. Chinese and Japanese Wikipedia are battle fields for manga, anime, TV series, and entertainment fans. TVB product appear quite often in the Chinese list, and well, the number 19 most disputed article in Japanese Wikipedia is “Penis”!

“So What?” is probably what you are asking. Generally speaking the implication of these kind of studies are two-fold:

1) These results could help Wikipedia and similar projects (which are already many, and growing) to be better designed, considering these experiences and the observations we made. Local effects shouldn’t be neglected and specially Wikipedias with smaller community of editors could be inefficiently very much focused on local issues.

2) we believe that this kind of case-studies (Wikipedia being the case) could help us and social scientist to understand more about human societies. Topics like conflict emergence, its dynamics, its universal features, and the resolution mechanisms could be  empirically examined for the first time.  Most of the theories in social science could have never been tested against real world experiments (in contrast to natural sciences). But now, thanks to our digital life of today, we are able to track and analyse all the actions and interactions of a huge society of individuals (here, Wikipedia editors), so why not test the pre-existing social theories in a large “social experiment” of Wikipedia?

Read more about this project:

Yasseri, Taha, Spoerri, Anselm, Graham, Mark and Kertesz, Janos, The Most Controversial Topics in Wikipedia: A Multilingual and Geographical Analysis (May 23, 2013). Fichman P., Hara N., editors. Global Wikipedia: International and Cross-Cultural Issues in Online Collaboration. Scarecrow Press (2014), Forthcoming. Available at SSRN: http://ssrn.com/abstract=2269392

And more on Wikipedia by our team:

Török, J., Iñiguez, G., Yasseri, T., San Miguel, M., Kaski, K., and Kertész, J. (2013) Opinions, Conflicts and Consensus: Modeling Social Dynamics in a Collaborative Environment. Physical Review Letters 110 (8).

Yasseri, T., Sumi, R., Rung, A., Kornai, A., and Kertész, J. (2012) Dynamics of conflicts in Wikipedia. PLoS ONE 7(6): e38869.

Yasseri, T., Kornai, A., and Kertész, J. (2012) A practical approach to language complexity: a Wikipedia case study. PLoS ONE 7(11): e48386.

Yasseri, T., Sumi, R., and Kertész, J. (2012) Circadian patterns of Wikipedia editorial activity: A demographic analysis. PLoS ONE 7(1): e30091.

Mestyán, M., Yasseri, T., and Kertész, J. (2012) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data.

What can Wikipedia tell us about the Cannes Festival just before the closing

Among all the interesting events taking place today, one is the Closing Ceremony of 2013 Cannes Film Festival.

If you already have seen our recent paper on Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data, you already know that I’m a big fan of movies.

In that paper we investigated the possibility of predicting the future success of movies based on the activity level of Wikipedia editors in combination with page view statistics. We applied a very simple linear model on a very rich set of Wikipedia transactional data and, well, at the end could make rather good “post-dictions” about a sample of USA movies released in 2010.

We all know that “Prediction is very difficult, especially about the future!”, so, the question is weather we could use the method we used in that paper to predict anything about movie success in future?

This is not what I want to talk about now! But in an adventures Saturday evening, I did some  data collection to see whether Wikipedia could give me a hint on the award winners of tonight Cannes closing ceremony.

There are 20 movies in the Competition section. All of them have an article in English Wikipedia, though some very short. First I collected some of the activity measures: Length of the article for each movie, how many times the page has been edited, and by how many distinct editors, how many times the page has been viewed from the beginning of the Festival (by editors and random readers), and finally how many different Wikipedia language editions have an article about the movie.

An interactive visualisation of the data is here (click on it!) Image

All pages together have been viewed more than 600,000 times. That’s a big number. However I was surprised looking at the small number of edits by even smaller number of editors: 15 articles are edited less than 50 times and by around only 5 editors! The average length of all 20 pages is 3700 bytes, just slightly more than a page. Most of the movies have an article in 3 or 4 different languages and no more (including English).

Well, most of the movies are not released yet, that might explain why they are so much under-represented in Wikipedia at the moment. Nevertheless, there are already interesting patterns.

The top-4 movies in respect of page views are also among the top-4 in number of edits, editors, language versions, and are also relatively longer. There is an exception though: The Past (the new drama of Oscar winner Asghar Farhadi) which is 8th in page view ranking, but has comparable activity parameters  to the top-4.

Play around with the visualization, you may see other patterns.

Now let’s focus on the top-3 of the most viewed articles, which are well separated from the rest of the movies: Only God Forgives a Thriller by Nicolas Winding RefnInside Llewyn Davis The Coen Brothers‘ Drama, and Behind the Candelabra by Steven Soderbergh.

The first movie of these 3 is released on 22 May in France and that might explain why is that so popular. See the diagram below (clickable), which shows the daily page views from a week before the Festival opening until yesterday (click to enlarge).


The first peak is clearly due to the nomination announcement on 18 April and the second peak of Only God Forgives  is due to its release. So, what I’m saying is that may be Coen’s have done a better job and we only need to wait until it reaches the market. We will see how the Juries think about it!

Now you may think I’m a Coen’s fan, but No! My favourite directors among these 20 (actually 21, counting Coen Brothers 2!) are Roman Polanski and Asghar Farhadi with Venus in Fur and The Past this year. Talking about directors, let’s have a look at the Wikipedia page view statistics of directors and compare them to their movies. The following figures show the daily views for those two directors and the movies they brought to Cannes this year. Yellow lines are the movies an red ones for the corresponding directors (click to enlarge).


That’s interesting. Isn’t it? The Wikipedia article of Asghar Farhadi and his movie (right panel) are not only at the same level of “popularity” but also their fluctuations are heavily correlated (the second peak comes from the movie release in France), whereas Roman Polanski (left panel) seems to be much more famous than his movie with weird up and downs in his data!

The last piece is on the main Wikipedia article about the event: 2013 Cannes Film Festival with more than 123,000 visitors within the last 2 months. If someone wants to have a baseline to do details fluctuation analysis on individual movies, I would recommend the following diagram, which clearly shows the main events and the overall public interest in them.


And Finally, don’t forget to take a look at our paper:

Mestyán, M., Yasseri, T., and Kertész, J. (2012) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. Forthcoming.