Research Design in Social Data Science; the online course

About a year ago, Sage Campus contacted me with an offer that I could not refuse! An opportunity to work with a professional team of designers and developers to produce an online course on Research Methods in Social Data Science.

I have been teaching different methods courses in the area of social data science over the past few years, and have been doing research myself in the same area for about a decade, but doing something is very different to teaching how to do that thing. I learnt it hard way!

Obviously, we do design and redesign and think how to frame and reframe our studies and research projects at various stages starting from writing the proposal, all the way to preparing the final publications. However teaching the same process, in a rather abstract medium, is rather challenging. Particularly in a field such as Social Data Science that has a yet forming identity.

I am glad that I accepted the challenge and being privileged to have the great support from Sage, finally managed to design, develop, and publish the course earlier this year.

Among many aspects of the interactive environment of the course, I particularly like the animations which give the course takers an overview of each module in a rather engaging and entertaining way.

Here is a sneak-preview!

The first Cohort of the course was launched in October and I must say the feedback I received from the course takers was  very flattering and beyond my expectation! The next cohort is scheduled for March and I cannot wait!

Finally, if you promise not to tell anyone: the course is being turned into a book to be published by Sage next year, but more about that later!

Screenshot 2019-12-02 at 16.42.05


The Internet and your inner English Tea Merchant

Earlier this year I had the honour of being invited to give a TEDx talk in Thessaloniki. That was an amazing experience, I had never talked to 800+ people, being filmed by 4 cameras, and live broadcasted all at the same time! It was kind of pushing it to limit for me but it was really fun! I must say that the TEDx Thessaloniki team were extremely professional and helpful! The talk is now on youtube, but of course my delivery was slightly different to the script (try to memorize a 15 minutes lecture and then deliver it to a huge crowd!). So, I though I’d post the script here as well as the video!

In November 2015, when the terrorist attacks happened in Paris, the world went into shock. People and nations all around the globe showed their solidarity in different ways: Iconic buildings were lit in the colours of the French flag, candles were lit in the streets, while online, people showed their respect by applying French flag filters to their profile pictures.

Unfortunately, earlier this year, another terrorist attack occurred but this time in New Zealand. 51 innocent people died as a result. Yet not once did I see someone update their profile picture with the NZ flag, let alone an entire building illuminated in its colours! well, mostly because we actually don’t know how the New Zealand’s flag looks like!

Joking aside, you might say, well, Paris is the capital of France and France is central to Europe, which is central to the world, whereas New Zealand is waaay down there, below Australia, in the corner!

You might say, the attacks in Paris were conducted by fundamentalist Muslims, whereas in New Zealand the victims were Muslims and … you know you don’t want to support Muslims on your Facebook profile, particularly if you want to travel to the US in the close future! You might say, in Paris 130 people got killed whereas in New Zealand the number of victims was only 50, so it’s not really worth the trouble of updating your profile picture!

Then I might say, hey how about Sri Lanka? Three weeks ago, there were a series of terrorist attacks by ISIS in Sri Lanka, killing more than 250 Christians, why didn’t we illuminate our buildings then!? You would say, yea, we just said, Sri Lanka is also down there in the corner! We don’t know how their flag looks like either. We might have guessed New Zealand’s flag must look like Australia’s, but have no clue about Sri Lanka’s flag!

You might think I’m joking, but actually all these “excuses” that I listed are observed in a large-scale data analysis that we conducted to measure collective attention and collective memory of people, when it comes to bad news and disasters.

But to measure “public attention” and how much people care about a topic, we had to be creative! As no one wants to walk up to people on the street and ask them: “Hi there, on a scale of 1 to 10 how much do you care about this disaster?” So instead, we turned to the internet, where people willingly share, with everyone, exactly how much they care!

In particular, we focused on how people reacted to airplane crashes. To do this, we looked at airplane crash articles on Wikipedia and counted how many times people viewed them within the first week after the crash. Wikipedia has been around since 2001 and since then we have had more than 200 crashes. First thing we observed was that there is a big difference in the amount of attention that events trigger if the number of casualties is smaller or larger than 50 people. Basically, events with more than 50 deaths create much more public attention. That was a finding very well received by terrorist groups all around the world!

Then we thought what about the nationality of passengers, does, let’s say, an American death receive the same amount of attention to a Greek death? (as a Persian I’m historically very excited to talk about Greek deaths! I mean ancient Greeks; you guys are ok!) It was hard to determine the nationality of all the victims on all these flights, and the exact location of a crash is not always known. But we could easily extract the subcontinent of the operating airlines. And guess what, we found that on average, when an airplane involved in a  crash was operated by a North American airline, the attention the crash was likely to receive was 50 times more than if the plane had been operated by an African airline with the same number of deaths!

A European death on average triggers 16 times more attention than an Australian death! So, Australian countries are really in the corner!

These analyses were based on data we collected from English Wikipedia. We repeated the same analysis using Spanish Wikipedia, and the good news was that we found that the readers of Spanish Wikipedia are equally racist! There, the largest attention is given to the Latin American flights of course.

Of course, we are biased when it comes to how much we care about things and places. We care much more about things that are similar to us, closer to us, and … benefit us!

Let’s have a look at this map:


Cornell University – PJ Mode Collection of Persuasive Cartography

This is one of my favourite maps. Here we have the British Isles in the middle, China at the left side, the rest of Europe, Africa and the rest of Asia are here at the right side. Oh, and here are the Americas. Yea, okay, so the map is a little bit odd. But hey, I didn’t make it up. It was in fact produced by an English Tea Merchant in 1930’s who titled it “The World”. UK is big and right in the centre. Which of course, from his point of view “the world” would look like this. And also, from the point of view of some 17.5 million people in the UK who voted for Brexit!

But let’s not point fingers. I mean, we all have our own inner English Tea Merchant! Our perception of the world, countries, and most importantly humans is as distorted and biased as this map. And not only our perception, but simply how much we care about the world and humans.

Interestingly, the Internet is a great tool to show us these biases. Because everything that we do on the Internet leaves a digital footprint and by analysing the data generated by our activities on the Internet, we can have a global-scale picture of our behaviour, similarities, differences, biases and subjectivities. Sometimes all we need for change is a mirror that we can see ourselves in.

But Internet also provides us with one more thing. No, I mean apart from the increasingly degrading pornography! Internet, provides us with the sum of the human knowledge! And cat videos. Let’s focus on the first one. We have things like Wikipedia that I mentioned, Wikipedia is the largest repository of human knowledge online, That is to say collection of human knowledge that is mostly collected by young, white men from rich countries, but still!

To be fair, what distinguishes Wikipedia from other things that are produced and written by rich, white men, is that theoretically it’s open to everyone. Any person who can read Wikipedia can also edit it. That of course leads to huge editorial wars that I have spent a large portion of my short career studying. But these edit wars are exactly what makes Wikipedia reliable and great! Articles get edited again and again and after a while they are so well polished that all the editors are happy with them.

Another interesting thing about these huge online repositories are that unlike paper-based encyclopaedias, people actually use them! Whenever a new crash is reported, people flock to the site to read about it. If we then trace the paths of these readers to see what they read next, we find they continue to engage with Wikipedia to learn about previous airplane disasters. Which in turn, inflates the attention that these previous crashes receive to such an extent that new interest completely overshadows the initial attention that a plane crash page received when it was first reported!

For example, when the Malaysia Airline passenger flight was shot down by military missile in 2014, not only did many people read about this event on Wikipedia, but also, we could see a significant increase in the readership of the article about a similar event in which an Iranian commercial flight was shot down by US Navy in 1988 killing 290 normal citizens and flight crew.

It’s not only you who goes to Youtube to quickly watch the match highlights of yesterday’s game and ends up watching all football videos that have ever been uploaded to the Internet! Analysing these traces of Wikipedia users, we also found other interesting patterns. For example, we saw that  the flow of attention from a new crash is bigger to the old crashes that are more similar in cause and geography, and are closer in time. So, our collective memory is biased just like our collective attention. For instance, the flow from current events to past events starts to vanish if the time separation between them exceeds 40 years.

So although we’re interested in past events, there is a limit to how far back we’re actually willing to go! With many people more interested in recent historical events over those that occurred over 40 years ago. That may not sound great, but the fact that we are just a few clicks away from the whole history of human kind, and that current events trigger our interest in exploring the past is an entirely Internet-mediated phenomenon.

The last thing about the Internet, is that it also connects us! It theoretically connects us to literally any other human on this planet with access to the Internet. And Justin Bieber! Think about Twitter, the conversations that it hosts around live events, political topics, cultural phenomena, during which millions of users from all round the world join in and talk -and troll- each other all at the same time! This has never been possible in the history of mankind to have such a huge “town hall” or as you say “agora”. Of course, we criticise social media and Twitter, particularly for certain things such as fake news, hate speech and limiting our attention to people that are similar to us and putting us in filter bubbles, but remember! We created the Internet, and it resembles our offline world.

Fake news and hateful speech have existed for a very long time.

Many people argue that these are the main ingredients of popular media. When it comes to filter bubbles we only need to go back 30 years ago (because who wants to go back 40 years imma right!), to find that the majority of people were exposed to only one newspaper, very few TV channels and could only really interact with very similar people within their vicinity on a daily basis. For instance, people at their local church or their fellows at their regular strip club.

But today, thanks to the Internet, we are only a few clicks away from alternative news outlets and people with opposing ideologies. Unfortunately, we chose not to interact with people of different opinions much, and social media platforms encourage us to trim our social ties, keep the ones we like, and avoid facing people who are different. But if we want, we can easily break the bubbles and expose ourselves to a whole different environment. Step out of our comfort zone and use this opportunity to get closer to one another.

Remember, the same technology that these days we claim is killing our democracies and wiping out our civilizations, just a bit more than a decade ago, led to the creation of something like Wikipedia. Of course, the difference in Wikipedia is that people of opposing opinions HAVE To work together and get to a consensus, whereas on social media we are also just one click away from removing a connection or ending a friendship.

We have only just arrived on planet Internet. A planet whose geography, size, pace, and physics are all over the place! They are nothing like what we have experienced before in the thousands of years of human social history. We have to start discovering the rules of nature all over again! But as scary as that sounds, this also means we have the chance to grow and learn like never before! We haven’t done great on planet earth and we have almost destroyed it. But let’s do better with the Internet! We can unite or stay divided. But we can and we should save the Internet. Because the Internet is the new land.

Your attendance to a concert affects the music your friends listen to

In a recent work, we studied music listenership patterns of 1.3 million online users to measure the direct and indirect effects of live concerts on song plays. We observe social contagion for only a certain type of musician and discuss how it can affect the music market.

The Internet has fundamentally reshaped music and other cultural markets. The ubiquity of music in the digital world had been presciently predicted by David Bowie who envisioned the future of music as something akin to running water or electricity. And indeed, we are now one tap-of-the-finger away from almost any music track that we want to listen to—at almost no cost. This is a remarkable departure from a time when most music had a physical embodiment in the form of records, tapes, CD’s, or even MP3 players. Today, music is in the air.

An important question now is how does this revolutionary change affect the music market? More precisely, how are musicians supposed to make money? A rather straightforward answer is through live events and concerts, but is the revenue of a concert or tour limited to exclusively ticket sales and broadcast revenues? It has been argued that live events stimulate secondary and indirect sources of revenue by growing the musician’s fanbase, which itself leads to more sales.

In this work, however, we find evidence that music listenership can be contagious. Namely, a live event not only can increase listenership among people who attend the event, in certain cases, it can “infect” the non-attendee listeners who are in the social proximity of concert attendees. The contagion however is complex, meaning that its dynamics are not defined by the structure of the underlying social network. We show that the fame of the musician plays an important role in moderating the size of the contagion.

While the increase in listenership of fans who attended a concert is about the same no matter the type of artist, the secondary effect on non-attendees is much larger for well-known artists as compared to emerging stars (the so-called “hyped” artists).

Putting it simply: if my friends attend a concert featuring a band that I am likely to know—let’s say Metallica—it is likely to increase the number of Metallica songs I listen to just after the event. But if a band is less popular (and perhaps I have not heard of them), there is no such secondary effect.

The additional income that the social contagion can bring for a typical concert, according to our most conservative estimates can be as much as a few thousand dollars per event. But as this additional income can only be recovered by the most established artists, a rich get richer mechanism holds, further increasing the existing inequality in the market.

In the era of Myspace, it was widely believed that the Internet is democratizing the music industry. We may need to rethink such a conclusion; considering how social influence can create avalanches of attention and revenues for bigger names at the expense of other, less well-known artists, the Internet might not be so egalitarian after all. While there is certainly an unparalleled opportunity for new artists to make their work available to the world via such platforms as SoundCloud, they may have difficulty being heard over the songs of the well-established, big-name musicians.

The modelling paradigm that we adopted in this work is abstract enough to be applicable to other collective behaviour in online media. Political engagement, participation in public good actions, and the spread of (mis)information are a few examples. How social media can affect these “markets” via social influence and contagion is among the most important questions facing computational social scientists today.

A paradoxical advantage of studying online systems is that the very same technology that is reshaping our personal and social behaviour generates unprecedented amount of data that can be utilized to study the very same changes. And this may be what makes Social Data Science the fastest growing field of research today.

The Nature Festivals Final WW Sign BB copy1

The Twenty-First Ion Prophecy – The Nature Festivals –

Fluid sexism; it is rarely an isolated experience

Earlier this year, we finally published the results of our project on Everyday Sexism.

We used computational text mining techniques to analyse the content of some 80 thousands stories of everyday instances of sexism posted on the Everyday Sexism website.

Our results suggests that sexism is fluid; it’s not limited to a certain space, class, culture, or time. It takes different forms and shapes but these are connected. Sexism penetrates all aspects of our lives, it can be subtle and small, and it can be violent and traumatizing, but it is rarely an isolated experience.


Screenshot 2019-06-04 at 16.02.49

Network visualizations of the topics. The weight of the connections between pairs of the topics is based on the similarity of how the words are assigned to them.


The abstract of the paper reads:

The Everyday Sexism Project documents everyday examples of sexism reported by volunteer contributors from all around the world. It collected 100,000 entries in 13+ languages within the first 3 years of its existence. The content of reports in various languages submitted to Everyday Sexism is a valuable source of crowdsourced information with great potential for feminist and gender studies. In this paper, we take a computational approach to analyze the content of reports. We use topic-modeling techniques to extract emerging topics and concepts from the reports, and to map the semantic relations between those topics. The resulting picture closely resembles and adds to that arrived at through qualitative analysis, showing that this form of topic modeling could be useful for sifting through datasets that had not previously been subject to any analysis. More precisely, we come up with a map of topics for two different resolutions of our topic model and discuss the connection between the identified topics. In the low-resolution picture, for instance, we found Public space/Street, Online, Work related/Office, Transport, School, Media harassment, and Domestic abuse. Among these, the strongest connection is between Public space/Street harassment and Domestic abuse and sexism in personal relationships. The strength of the relationships between topics illustrates the fluid and ubiquitous nature of sexism, with no single experience being unrelated to another.

The structure of world-stock-market network resembles geographical ties

In a recent paper, we studied 40 stock markets from top GDP countries to analyse the correlations and connections between them. As expected, we did observe strong correlations between ups and downs of these markets at the global level. However, when using Random Matrix Theory we detected the sub-communities of this global network, we realised that geography plays an important role.

In this “Brexity” times, the most notable observation is how deep the UK market are embedded in the sub-network of European markets. We often hear that the European partners can be replaced with the US and China. The numbers do not support this!


Dendrogram of the forty markets based on their cross-correlations. The colored groups contain members with at least thirty percent of correlation. The top six markets do not belong to any group.

The paper’s abstract reads:

Forty stock market indices of the world with the highest GDP has been studied. We show each market is a part of a global structure, that we call “world-stock-market network”. Where the correlation between two markets is not independent of the correlation between two other markets. Towards this end, we analyze the cross-correlation matrix of the indices of these forty markets using Random MatrixTheory (RMT). We find the degree of collective behavior among the markets and the share of each market in the world global network. This finding together with the results obtained from the same calculation on four stock markets reinforces the idea of a world financial market. Finally, we draw the dendrogram of the cross-correlation matrix to make communities in this abstract global market visible. The results show that the world financial market comprises three communities each of which includes stock markets with geographical proximity.

Corruption is in the fabric of societies

Many think that corruption is a result of wealth or the lack of it. Some assume that tighter regulations might stop corruption. Hence, socio-economic metrics have been used to explain the level of corruption in different places with different regulatory regimes.

In our recent work, we show that corruption is in the fabric of the societies and the structure of the social networks in cities are highly related with the chance of corruption. Certain characteristics of a towns’ social ties, such as fragmentation or diversity of residents’ connections, measured via an online social network, predict corruption in local government contracting above and beyond socio-economic variables.

Here is the abstract of the article:

Corruption is a social plague: gains accrue to small groups, while its costs are borne by everyone. Significant variation in its level between and within countries suggests a relationship between social structure and the prevalence of corruption, yet, large-scale empirical studies thereof have been missing due to lack of data. In this paper, we relate the structural characteristics of social capital of settlements with corruption in their local governments. Using datasets from Hungary, we quantify corruption risk by suppressed competition and lack of transparency in the settlement’s awarded public contracts. We characterize social capital using social network data from a popular online platform. Controlling for social, economic and political factors, we find that settlements with fragmented social networks, indicating an excess of bonding social capital has higher corruption risk, and settlements with more diverse external connectivity, suggesting a surplus of bridging social capital is less exposed to corruption. We interpret fragmentation as fostering in-group favouritism and conformity, which increase corruption, while diversity facilitates impartiality in public life and stifles corruption.


Ego networks with low (a) and high (b) diversity. Colours indicate membership in detected communities in the ego network. Circles denote users from the same settlement as the ego, while triangles mark users from elsewhere. The high diversity user’s network has clusters of alters mostly from different settlements.

The curious patterns of Wikipedia growth

Wikipedia is arguably the number one source of information online for the speakers of many languages. But not all the different language editions are developed equally. The English edition is by far the largest and the most complete one, and the other 280 language editions have many fewer articles.

The coverage of different language editions also doesn’t follow a standard template. Some language editions are heavier on politics, for instance, and some have more articles on science related topics, leading to even different populations of controversial topics in different languages. Why does the coverage of different editions vary so much?

You might think it’s to do with the emphasis different cultures place on different subjects, or the ease of explaining a topic in a certain language. But new research has found a surprising pattern among the different editions of Wikipedia. It suggests the shape of the site’s growth is much more complex and tied to the different community of editors who build each edition.

Screen Shot 2017-10-19 at 13.34.01

A recent study, published in the journal Royal Society Open Science, analysed the patterns of some 15,000 article topics that have been covered in at least 26 language editions. The researchers looked at the sequence of languages that each article has appeared on chronically and tried to mine patterns in the trajectory that the article navigates through from one language to another.

Using different computational techniques, they managed to cluster languages into groups that mimic similar coverage patterns. Among the 26 languages that the authors analysed, English, German, and Persian stand out and do not mix with any other groups of languages. But there are three more groups that are mostly robust even when the authors change the algorithm they used for clustering.

Italian, Finish, Portuguese, Russian, Norwegian, Mandarin and Danish stick together. Polish Dutch, Spanish, Japanese, French, and Swedish cluster together. And finally, Indonesian, Turkish, Hungarian, Korean, Ukrainian, Czech, Arabic, Romanian, Bulgarian and Serbian show similar patterns.

What is surprising is that these grouping can’t simply be explained by language families, geographical closeness, or cultural similarities. What seems to be the underlying factor is more related to the characteristics of the community of editors of each language edition.

To test this systematically, the authors considered six factors for each language edition. These included the number of pages, the number of edits, the number of administrators and a measure of the content quality. The other two factors were the total number of active speakers of the language and the level of access they had to the Internet using the international Digital Access Index ranking for the country in which the language is primarily spoken.

These six parameters partially explain the differences between different clusters, but the authors suggest that the clustering of the languages is driven by a more complex combination of socio-economic variables that can capture features such as the average Internet literacy in a country or the general attitude towards the importance of knowledge and education.

The results of this paper become more interesting when compared to an earlier work that looked at the time of the day that edits are mostly committed in each language edition. While generally Wikipedia is edited during the afternoon and early evening, some language editions are being edited more in the morning and some later in the evening.

When you look at these groups of languages, there seem to be similar patterns. Unfortunately the set of languages studied in the two works are not the same and so a direct comparison is not possible.

What this research does is remind us how little we know about how information is being spread on the Internet, what the patterns of the online information landscape are and more importantly, what factors determine these patterns. The role of the Internet and the information resources it provides, in formation of our opinions and decisions that we make at the individual and societal level is undeniable. Answering these questions might help us to achieve a more democratic and unbiased global information repository.