NOTE: You can find the RPubs post (which includes the code used to clean data and create plots) here and our RMarkdown code here.
The goal of this project is to explore the Twitter accounts of 2020 presidential candidates by using sentiment analysis. Specifically, we ask: how do presidential candidates differ in their use of Twitter to publicize policy positions, garner support, and participate in the political arena? Are certain types of Tweets more effective at attracting support, as evidenced by a greater number of Tweet interactions?
One way to explore these questions is by using sentiment analysis, or the process of computationally categorizing opinions and emotionality in a piece of text. Some data scientists, for example, have used this technique to argue that Donald Trump’s Twitter timeline is populated by different Tweets from different people.
Sentiment analysis has yet to be utilized to explore how different candidates use Twitter in different ways. By utilizing sentence-level sentiment analysis on individual Tweets as well as word-by-word analysis using the NRC Word-Emotion Association Lexicon, we’re able to explore these questions. The analysis leverages the power of the sentimentr package to better understand what types of emotions “sell” on Twitter and how candidates use the platform as a mechanism to attract political support.
We begin our analysis with the data collection process. We use Python to gather a massive amount of Tweets (37,218!) from 12 presidential candidates’ Twitter feeds. Specifically, we used two different methods for pulling Tweets. The main method involved using Twitter’s Tweepy Python library. We created an API that allowed us to pull various information about the most recent 3200 tweets by a user. The API link could be found here.
In a later step, we will go further back in time to pull Tweets from 2016 presidential candidates for a comparative analysis. In order to do this, we use a public GitHub repository that creates a scraper to pull candidate Tweets. The scraper automatically scrolls through the inputed date range on a given candidate’s website and mines data while scrolling.
You can find further documentation of our data pulling process on our Github.
After doing some basic data cleaning, we can begin with a preliminary look at the structure of our data. First, how equal is the distribution of data among candidates?
Our dataset is lacking in Tweets from Julian Castro and Joe Biden, and to some extent Cory Booker. This is mostly because those candidates don’t Tweet as much; Joe Biden’s Twitter account only has 1,700 Tweets. But it’s also because some candidates Retweet more than others, and we filter those out.
Next, how does the distribution of Tweets over time look?
From this, it's obvious that Andrew Yang Tweets way more than everyone else, given that most of his tweets (>2000) are from the last three months (February, March, and April).
Sentiment Analysis Using sentimentr
Following this preliminary look at our data, we can begin with what we’re really interested in: sentiment analysis of candidate Tweets.
We initially performed our analysis using the syuzhet package in R, but found its output to be imprecise and misleading. This was because the syuzhet library was specifically made for sentiment analysis of fiction text (and the domain-specifity of sentiment analyses is very important). We also ran into problems because the syuzhet library failed to account for valence shifters (such as “not”, “very”, or “doesn’t”).
The sentimentr package addresses both of these concerns; it is fine-tuned for sentence-level analysis, has been used on Tweets in prior analyses, and it accounts for contextual valence shifters.
As an example, the syuzhet package would code the sentence “I don’t hate ice cream” as relatively negative, because the keyword hate determines the overall polarity of the sentence. The sentimentr package accounts for the negator don’t and reverses the polarity score to be slightly positive. For a more thorough review of sentimentr, see this post from its creator.
Sentence- and Tweet-Level Sentiment Analysis
First, we can explore overall sentiment by depicting what proportion of sentences fall into the categories of negative (sentiment < 0), positive (sentiment > 0), or neutral (sentiment = 0)
Individual sentences tend to be overwhelmingly positive. Does a different trend arise when we look at entire Tweets (by averaging the scores of each individual sentence)?
This reveals that there is less neutrality in Tweets than there is in sentences. This makes sense when we consider the likelihood of a Tweet being entirely composed of neutral sentences (sentiment = 0) is relatively low.
An additional question is if sentiment differs by candidate.
Evidently, Kamala Harris, Kirsten Gillibrand, and Bernie Sanders are almost never neutral. By contrast, Joe Biden, Andrew Yang, and Beto O’Rourke tend to be neutral more often than most candidates.
Aside from the binary classification of positive/negative, we may also be curious which candidates tend to be the most positive and the most negative:
There is considerable variance in positivity across candidates. Pete Buttigieg’s Tweets tend to be, on average, 3.74x more positive than those from Kamala Harris.
Incorporating the NRC Emotion Lexicon
After exploring how sentiment differs by candidate, we can perform additional analyses by looking at the individual words a candidate uses. The inspiration for this decision was the book Text Mining with R, written by Julia Silge and David Robinson.
We use the NRC Word-Emotion Association Lexicon, which is a crowdsourced list of 14,182 words and their most frequent association with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). This allows us to probe further into candidate Twitter feeds to see how candidates differ in their use of specific emotions.
One interesting question is which words garner the greatest Tweet attention, as evidenced by Favorites and Retweets. We answer this question by finding the median number of interactions (Retweets + Favorites) for non-stop-words (e.g. “the”, “of”, and “are”) and plotting the most popular.
The plot reveals some interesting insights about what topics are especially popular for certain candidates. It seems as if Tulsi Gabbard performs best when she mentions issues related to international security (terrorism, waging, jihadists, genocidal). Cory Booker received the most Retweets and Favorites during the Kavanaugh hearings in September 2018. Joe Biden is an especially interesting case; nearly all of his most popular words come from one phrase: “a battle for the soul of this nation.” He uses that phrase—or some variation of it—a lot.
More generally, words tangentially related to some legal issue in the Trump administration tend to perform the best on Twitter:
By incorporating the NRC lexicon, we can see if any specific emotions “sell.” In other words, we can see if some emotions outperform others when it comes to getting Tweet attention.
The results from this are really consistent: negative emotions—like disgust, anger, sadness, and fear—sell.
Even when we stop looking at specific emotions and turn back to our initial sentiment scale of -2 to 2, we find the trend is consistent and statistically significant.
A linear regression of Tweet interactions by sentiment reveals that a one unit increase in sentiment results in a loss of 4492 Tweet interactions (p < 0.0001).
The Trump Analysis
A final topic of interest is how Donald Trump affects the sentiment of candidate Tweets. Knowing that Donald Trump is a unique president with uniquely low levels of popularity, one may hypothesize that Tweets focusing on or mentioning the president may be different from most others.
By analyzing differences between Tweets which mention Trump and those that don’t, we can understand how the “Trump variable” impacts Tweet behavior and popularity.
First, who talks about Trump? Are some candidates (e.g. those who need to access Trump’s base in order to secure the Democratic nomination) less likely to attack the president than others?
Joe Biden, who pitches himself as a moderate and will likely see electoral success by engaging “Never Trump” voters and moderate conservatives, mentions Trump much less frequently than more “radical” candidates like Bernie Sanders.
Next, how does sentiment change in Tweets that mention Trump compared to those that don’t?
Unsurprisingly, Tweets about Donald Trump are significantly more negative than those about other topics.
Again, our word-by-word analysis allows us to get more fine-tuned than this simple binary classification:
The trend holds true: positive emotions like joy, trust, and anticipation are less common in Tweets that mention Donald Trump. The corollary is also true. Negative emotions such as disgust, sadness, anger, and fear are more prevalent in Tweets that mention the president.
Aside from emotion, does the “Trump variable” affect Tweet popularity? One may hypothesize that Trump’s historically low popularity may lend itself to higher popularity in Tweets which disparage him.
Tweets which mention the president tend to get 2.2x the number of interactions (13,087) than those which don’t (5,912).
We want to continue this work by expanding upon current analyses of 2020 presidential candidates. We are curious about how other variables (such as position in race) affect the use of certain emotions and sentiment, or what kind of differences arise when comparing candidates of different demographic backgrounds.
Finally, we would like to see how Twitter has evolved as a platform by using 2016 presidential candidates’ Twitter feeds as a comparison. We are curious if Tweet content has changed in a way which reflects general trends in politics, such as growing partisanship, heightened polarization, and more identity-centric notions of political identity.