Introduction
The starting point of this datastory is the QuoteBank dataset, "an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020" (Quotebank: A Corpus of Quotations from a Decade of News, 2021).
The key idea of the project revolves around the fact that in this enormous amount of quotation there may be a good portion of quotes referring to someone else. For example, since many of these quotations come from news articles, it is likely that they will be from political opponents stating something about other political or well-known public figures.
From these considerations, we ask ourselves the following questions:
-
Is it possible to create a graph where each node is a person and the edges represent a quotation where "person A stated something about person B"?
-
What if these edges have a weight depending on the sentiment of the statement (i.e when a quotation is positive the weight is high, and when it is negative it is small)?
-
Would this particular choice of the weight result in forming groups that can identified as communities? We believe that people having good interactions between them (i.e having good statements about each other) would be near, whereas those having bad interactions would stay apart.
-
What communities can we find in the clustered graph? (Political parties, Man/Women groups, etc).
Which lead us to our project:
Mapping of social communities with
NLP analysis of quotes
~ ~ ~
1. Project pipeline
1. Construct the graph with quotes from Quotebank
We will select only the quotes from Quotebank where a person is mentioned and identify the person and the speaker in the database Wikidata. We then compute the positivity/negativity sentiment score for each quote and construct the graph based on it.
2. Find communities with the Louvain algorithm
Running the Louvain algorithm will allow finding a set of communities based on the edges of the graph and the weights they have. From all the communities found we will consider only the most populated ones.
The goal is to find known and identifiable communities! Remember that the graph is constructed solely on the positivity/negativity of the quotes!
No prior knowledge of history, personal relationship or identity of our subjects.
3. Identify shared characteristics within the communities:
We are using Wikidata for adding information to each person in the graph (e.g: gender, political party, career, etc).
This will allow us to find common characteristics within the communities based on the distribution of the characteristics of the people who are members of it.
2. Graph Construction
The graph construction will start with some necessary cleaning of the data, we won't bother you about it here. After this necessary step, all the data will be channelled through our 3 factories!
2.1 Quote selection factory
The quote selection factory is where the hardest job is done: Recognize the quotes containing a mention to someone else and extrapolate the person's name from the quote.
This process follows two basic rules shown in the parchment!
We need 3 important things:
-
That there is a recognizable speaker of the quote
-
That we can recognize a person mentioned in the quote
-
Quotes with more than one person mentioned are ignored
In the video on the right, we have some examples of good quotes and bad quotes.
A big problem during the graph creation was the assignment of generic names such as "Joe" present in a quote.
Indeed, in this case, it would be hard to decide to who is referred to in the quote since we only have an alias of a mentioned person. Is "Joe" the person "Joe Biden" or "Joe Fletcher" ?
The solution we adopted was for every duplicates of alias, we keep only the alias that points to the most famous id/person. To understand which id/person is the most famous, we check the other aliases of the same id/person, and count the number of times they are speaker.
This solution seemed to be in its simplicity very effective. Try to guess the names on the video on the left!
2.2 Sentiment analysis factory
This step leverages the NTLK library, which allows us to compute a sentiment score of each quote.
This score is a value between -1 and 1 where positive values represent positive sentiment whereas negative values represent negative sentiment in the quote.
On the right, we can see the distribution of the sentiment score. The mean is 0.25! People tend to be positive when mentioning someone. 60% of our quotes have a sentiment score bigger than 0, 17% are equal to 0 and 23% are lower than 0.
In a second step, the sentiment score will be updated considering the number of edges between two nodes.
Indeed, it happens often that two persons mention each other multiple times. In that case the degree of these two nodes/persons is higher than 1. On the left, we can see that numerous of our nodes have high degrees.
The degree gives us information on the importance of the relationships. We will take this information into account when we compute our weights in the graph.
2.3 Computational Factory
In this last step, we define a function that takes the sentiment score and defines the weight of the respective oriented edge.
The function we decided to use can be seen on the right. α is a constant which we tuned until we had good results. In the end, we found that α = 5 was a good value for the constant in order to obtain good results.
In conclusion, for every eligible quote we have the speaker, the mentioned person, the sentiment score of the quote, and the computed distance!
We have for the year 2020, 55 306 eligible quotes out of 1 150 000 quotes, with 22 958 unique persons!
Oh that's a lot! Let's only keep the 500 most popular persons. We still have 16% of our quotes.
Let's add some features to our persons to make them more interesting! We will use Wikidata to add gender, political party, academic degree, religion, nationality, ethnic group, career and age. We labeled by hand the missing values thanks to the internet.
500 persons... that was a lot of work!
3. Statistics on the graph
3.1 Popularity
The first thing we want to check is the distribution of the popularity of people in the graph. We will consider someone popular based on the number of times she/he gets mentioned and on the number of times she/he mentions someone.
What we observe is the most popular people are Donald Trump, Joe Biden, Hillary Clinton, Barack Obama as we would expect.
3.2 Characteristics distribution for the top 500
Let's now observe how the different characteristics (i.e gender, career, ethnicity etc) are distributed over the 500 most popular persons! The unknown category of each characteristic is not being shown.
Our population is not evenly distributed. Our stereotypical person is a white christian american man that works as a Republican politician.
3.3. Fun facts
We are curious now to see how different groups of people interact! Let's see what general sentiment score they have when speaking about each other!
3.3.1 Women and men
We start by considering the groups defined by gender: women and men.
We can see through the sentiment score values in the graph that in general men and women tend to speak better about their own gender.*
Since the difference might look small we decided to run a t-test. These results were statistically significant at a significance level of 0.01.
3.3.2. Trump and Biden
By comparing the sentiment scores on the quotes relative to Joe Biden and Donald Trump we can see some differences:
-
Republicans tend to speak in a more positive way about Donald Trump than Biden (and vice-versa)
-
Biden and Trump tend to speak better about their own political party than the opposite.
-
In general, people speak more positively about Biden than Trump
Other interesting values can be found in the figure to the right!
*It is important to remember that women and men in Quotebank are not representative of the real population.
4. Community detection
Finally, we run the Louvain algorithm for detecting the communities in our graph.
Remember that no prior knowledge of history, personal relationship or identity of our subjects is used in this community detection, but only the sentiment score.
Below we can observe the fascinating result:
We can observe four main communities :
-
Blue community (number 1 in our notebook)
-
Red community (number 0 in our notebook)
-
Orange community (number 2 in our notebook)
-
Green community (number 7 in our notebook)
Feel free to zoom in on the graph if you are interested in someone in particular!
The nodes in the graph and their labels are proportional to the popularity of the respective person: The more popular he/she is the bigger the name on the graph is!
We can immediately see how the two names of Joe Biden and Donald Trump pop up in the graph. Since this graph has been obtained using the quotes from 2020 this doesn't surprise us, because of the presidential elections.
However, it is interesting to already notice how they seem to be the main nodes in two communities: The blue and the red one.
5. Community Analysis
The last step we need to take is to analyse the communities we obtained and to check if we can find some political, cultural or sports groups of the real world in them!
Before doing this we want to understand which are the main characteristics describing the communities. Specifically, we want to understand if there are some characteristics (e.g: gender, party...) that can define a community. By running a Chi-Squared analysis we could quantify the importance of the characteristics:
We can say that nationality, party and career are the characteristics that are the most important to explain our communities.
By comparing the percentage of each feature between the entire graph and the communities, we can conceptualize the group that is represented in the community. Our results are promising!
-
The red community is mainly populated by political figures from the Republican Party
-
The blue community is mainly populated by political figures from the Democratic Party
-
The orange community is mainly populated by people having a cultural career (i.e in the world of music, cinema and literature)
-
The green community is mainly populated by sportsmen
We are now curious to see what famous people belong to what communities.
Swipe to see!
6. Conclusion
The 2020 quotes that we are using allowed us to identify 4 realistic social communities: The Republican, the Democrat, the Sport and the Cultural community. For the Republicans, we saw Donald Trump as the leader together with Mike Pence. Likewise, for the Democrats we saw Joe Biden next to other well-known democratic figures. The king of the sports, is without surprise, Lebron James aka the king. We could almost argue that this alone validates our method. We were not surprised to see Kim Kardashian hanging out with Harry Potter in the cultural community. We also had surprising results, per example Roger Federer appears to be a fellow supporter of Trump. Maybe in some cases we are extrapolating a bit too much.
To go further, we tested our method on the Quotebank dataset of the years from 2015 to 2020 which can be seen in our notebook. Again our results are convincing, where we were able to identify political and other realistic social communities.
This leads us to say that by how we speak and by whom we speak about, it is possible to understand in which community we belong! To conclude, watch out, we can find out who you are aligned with, without you saying explicitly by only using simple open source tools for sentiment analysis, entity recognition and finally community detection using the Louvain method. Who knows what is possible to do with more precise NLP tools and more accessible and instantaneous data that can found on social media platforms where the speaker and the mentioned person is explicitly known.