Image credit: shotstash.com

Exploring UK MPs' Twitter Use Before and After the Introduction of the new 280 Character Limit on 7 November 2017

Introduction

On 7 November 2017 Twitter increased the character limit allowed in tweets from 140 characters to 280 characters. This project explores how this change affected the twitter use of United Kingdom’s Members of Parliament (MPs) in the House of Commons.

To conduct this analysis is interesting for two reasons. First, Twitter owes much of its popularity to the forced conciseness of the messages being posted, and has refused to allow users more space for many years. Second, Twitter has become an increasingly important tool for political communication, as politicians are using it to communicate with their constituents and the media (President Trump even uses it to communicate with other world leaders).

The projects primary interest is to conduct an Exploratory Data Analysis (EDA) of the change in tweeting behaviour as a result of the limit increase. It will aim at uncovering interesting patterns that can be extracted from the data, by investigating the differences between individual MPs and between the parties.

Objectives

Technical

  1. Demonstrate the management of data in a real-world data analysis project.

  2. Conduct an Exploratory Data Analysis aimed at getting an idea of what the data looks like and whether interesting patterns for future analyses emerge.

Substatial

  1. Analyse how the change to 280 characters affected how many characters UK MPs use in their tweets.

  2. To put the use of characters in tweets in perspective, we are going to explore how the MPs relate to each other on twitter by performing a social network analysis.

  3. Since the increase in characters could be due to the fact that different topics have dominated twitter before and after 7 November 2017, a preliminary text analysis is conducted.

Structure

To achieve these objectives, the project is structured in three parts. The first part outlines how the data is obtained, how it is cleaned in preparation for the analysis and how it is stored for subsequent use. The data gathering and cleaning part will be conducted in R.

In the second part an exploratory data analysis is conducted using Python.

Finally, in the third part exploratory data analysis is conducted using the quanteda package for R.

1. Download Data

# Import libraries
# We will use the rtweet package to access the Twitter API.
library("httpuv")
library("rtweet")
suppressPackageStartupMessages(library("tidyverse"))
library("feather")
library("readxl")
suppressPackageStartupMessages(library("lubridate"))

1.1 Obtaining the data from the Twitter API

The 3000 last tweets were downloaded for the UK MPs that twitter handles are available on Sunday 14 January 2018. The twitter handles were obtained from http://www.mpsontwitter.co.uk/list.

# read in list of UK MPs twitter handles
uk_mps_twitter_list <- read_excel("uk_mps_twitter_list.xlsx")

# create twitter authentication
twitter_token <- create_token(
  app = "rtweet_token",
  consumer_key = "", # consumer_key omitted
  consumer_secret = "") # consumer_secret_omitted

# download the data in 10 chuncks, with added system sleep to not hit the download rate limit
df1 <- tibble()
for (i in uk_mps_twitter_list[1:60, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df1 <- bind_rows(df1, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df2 <- tibble()
for (i in uk_mps_twitter_list[61:120, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df2 <- bind_rows(df2, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df3 <- tibble()
for (i in uk_mps_twitter_list[121:180, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df3 <- bind_rows(df3, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df4 <- tibble()
for (i in uk_mps_twitter_list[181:240, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df4 <- bind_rows(df4, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df5 <- tibble()
for (i in uk_mps_twitter_list[241:300, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df5 <- bind_rows(df5, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df6 <- tibble()
for (i in uk_mps_twitter_list[301:360, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df6 <- bind_rows(df6, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df7 <- tibble()
for (i in uk_mps_twitter_list[361:420, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df7 <- bind_rows(df7, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df8 <- tibble()
for (i in uk_mps_twitter_list[421:480, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df8 <- bind_rows(df8, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df9 <- tibble()
for (i in uk_mps_twitter_list[481:540, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000)
  df9 <- bind_rows(df9, x)
}

Sys.sleep(60 * 16) # system sleep for 16 minutes to avoid getting rate limited

df10 <- tibble()
for (i in uk_mps_twitter_list[541:557, ]$twitter_handle) {
  print(i)
  x <- get_timelines(c(i), n = 3000) # needs at least 2000 tweets per MP, better more (3200 max)
  df10 <- bind_rows(df10, x)
}

timelines <- bind_rows(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10)


Raw Data Overview

After downloading, the raw data contains 1,441,438 tweets from 533 MPs. It consists of a total of 41 metadata variables besides the tweet texts. Roughly half of the tweets are retweets, while 751,012 tweets were actually sent by the MPs themselves.

1.2 Cleaning the data

Only tweets from 1 year prior to 7 November 2017 are selected. Retweets are removed from the data, as we are interested in how many characters tweets contain when the MP writes them themselves. Columns are added for the number of characters, the number of hashtags, and the number of mentions in each tweet. Also, the MPs parties are added. For the subsequent analysis the following columns are selected:

  • screen_name: The twitter handle used by the MPs.

  • created_at: The date of the tweet.

  • text: The tweet’s text.

  • nchar: The number of characters used in each tweet.

  • retweet_count: The number of times each tweet was retweetet.

  • number_of_hashtags: The number of hashtags that a tweet contains.

  • number_of_mentions: The number of mentions that a tweet contains.

  • party: The party the MP belongs to.

# remove retweets from the data
timelines <- filter(timelines, is_retweet != TRUE)

# add date
timelines <- mutate(timelines,
                    year = year(timelines$created_at),
                    month = month(timelines$created_at),
                    day = day(timelines$created_at),
                    date = make_date(year, month, day))

# choose only tweets starting one year before 7 November 2017
timelines <- subset(timelines, date > as.Date("2016-11-07"))

# replace &amp; characters with &
timelines$text <- gsub("&amp;", "&", timelines$text)

# add number of characters per tweet
timelines <- timelines %>%
  mutate(nchar = nchar(text)) # count the number of characters of the tweet (add ignoring emojis and links)

# add number of hashtags
timelines$number_of_hashtags <- sapply(timelines$hashtags, length)

# add number of mentions of other users (e.g. @theresa_may @BorisJohnson)
timelines$number_of_mentions <- sapply(timelines$mentions_screen_name, length)

# add party of MP from uk_mps_twitter_list
party_names <- data.frame(uk_mps_twitter_list$twitter_handle, uk_mps_twitter_list$party)
party_names <- uk_mps_twitter_list %>% select(twitter_handle, party)
party_names$twitter_handle <- gsub("@", "", party_names$twitter_handle) # remove @ from twitter_handle
names(party_names) <- c("screen_name", "party")
print(party_names)
timelines <- merge(timelines, party_names, by = "screen_name")
print(timelines)

# select variables we need
timelines_selected_variables <- timelines  %>%
  select(screen_name, created_at, text, nchar, retweet_count, number_of_hashtags, number_of_mentions, party)

1.3 Storing the data

Now, we face the question of how to transfer the data over to Python. Since Python cannot read R data frames, we need to save the data in R and then open it in Python. One way to do this would be to save the objects as .csv files. However, .csv files are rather large and problems can arise with the options used when storing and retrieving the data. Hence, we will use the feather package, which allows for faster storage and eliminates potential problems of data conversion (for a comparison of feather vs readr::write_csv() see this blog post). The resulting object ‘timelines_final.feather’ can be found in the same repository of this Jupyter Notebook.

# save the full downloaded data as feather object
write_feather(timelines_selected_variables, "timelines_final.feather")

2. Data Analysis in Python

# import modules
import datetime
import feather
import matplotlib.pyplot as plt
import matplotlib.style as ms
ms.use('seaborn') # use seaborn style for all plots
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns

# set max row display in pandas
pd.options.display.max_rows = 100

Read in the data

The data is imported from R using feather. To get an idea of what the data looks like, the first five rows in the dataset can be displayed with .head().

# import timelines data
timelines = feather.read_dataframe("timelines_final.feather")

# first five rows
timelines.head()

Relationships between variables

The relationships between variables are shown below. Scatterplots are drawn for joint relationships and histograms for univariate distributions. The data points are colored by party.

sns.pairplot(timelines, hue='party')

png

2.1 Analyse change in character use before and after 7 November 2017

2.1.1 First look: MPs of special interest

First, let us get a sense of how many characters individual MP’s used in their tweets starting from one year prior to the change to 280 characters. To this end, 6 MP’s of special interest are plotted, purely to get an idea of how much changed in their twitter behaviour before and after they were allowed to use 280 characters. These MPs are the Prime Minister, the Foreign Minister and six party leaders.

  • Theresa May (PM) - @theresa_may
  • Boris Johnson (Foreign Minister) - @BorisJohnson
  • Jeremy Corbyn (Labour leader) - @jeremycorbyn
  • Vince Cable (Lib Dem leader) - @vincecable
  • Ian Blackford (SNP leader) - @IanBlackfordMP
  • Nigel Dodds (DUP leader) - @NigelDoddsDUP
  • Liz Saville Roberts (Plaid Cymru leader) - @LSRPlaid
  • Caroline Lucas (Green Party leader) - @CarolineLucas
# sort time variable
timelines = timelines.sort_values(by='created_at', ascending=True)

# set plot parameters
fig, axes = plt.subplots(ncols=2, nrows=4, figsize=(14, 16))

# plot MPs
axes[(0,0)].plot(timelines[timelines.screen_name == 'theresa_may']['created_at'],
         timelines[timelines.screen_name == 'theresa_may']['nchar'],
         color='b')
axes[(0,0)].set_title('theresa_may')
axes[(0,0)].set_ylabel('Number of Characters in Tweets')
axes[(0,0)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(0,0)].set_ylim([0,320])
axes[(0,0)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')


axes[(0,1)].plot(timelines[timelines.screen_name == 'BorisJohnson']['created_at'],
         timelines[timelines.screen_name == 'BorisJohnson']['nchar'],
         color='b')
axes[(0,1)].set_title('BorisJohnson')
axes[(0,1)].set_ylabel('Number of Characters in Tweets')
axes[(0,1)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(0,1)].set_ylim([0,320])
axes[(0,1)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

axes[(1,0)].plot(timelines[timelines.screen_name == 'jeremycorbyn']['created_at'],
         timelines[timelines.screen_name == 'jeremycorbyn']['nchar'],
         color='r')
axes[(1,0)].set_title('jeremycorbyn')
axes[(1,0)].set_ylabel('Number of Characters in Tweets')
axes[(1,0)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(1,0)].set_ylim([0,320])
axes[(1,0)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

axes[(1,1)].plot(timelines[timelines.screen_name == 'vincecable']['created_at'],
         timelines[timelines.screen_name == 'vincecable']['nchar'],
         color='orange')
axes[(1,1)].set_title('vincecable')
axes[(1,1)].set_ylabel('Number of Characters in Tweets')
axes[(1,1)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(1,1)].set_ylim([0,320])
axes[(1,1)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

axes[(2,0)].plot(timelines[timelines.screen_name == 'IanBlackfordMP']['created_at'],
         timelines[timelines.screen_name == 'IanBlackfordMP']['nchar'],
         color='y')
axes[(2,0)].set_title('IanBlackfordMP')
axes[(2,0)].set_ylabel('Number of Characters in Tweets')
axes[(2,0)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(2,0)].set_ylim([0,290])
axes[(2,0)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

axes[(2,1)].plot(timelines[timelines.screen_name == 'NigelDoddsDUP']['created_at'],
         timelines[timelines.screen_name == 'NigelDoddsDUP']['nchar'],
         color='m')
axes[(2,1)].set_title('NigelDoddsDUP')
axes[(2,1)].set_ylabel('Number of Characters in Tweets')
axes[(2,1)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(2,1)].set_ylim([0,320])
axes[(2,1)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

axes[(3,0)].plot(timelines[timelines.screen_name == 'LSRPlaid']['created_at'],
         timelines[timelines.screen_name == 'LSRPlaid']['nchar'],
         color='lightgreen')
axes[(3,0)].set_title('LSRPlaid')
axes[(3,0)].set_ylabel('Number of Characters in Tweets')
axes[(3,0)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(3,0)].set_ylim([0,320])
axes[(3,0)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

axes[(3,1)].plot(timelines[timelines.screen_name == 'CarolineLucas']['created_at'],
         timelines[timelines.screen_name == 'CarolineLucas']['nchar'],
         color='g')
axes[(3,1)].set_title('CarolineLucas')
axes[(3,1)].set_ylabel('Number of Characters in Tweets')
axes[(3,1)].set_xlim(['2016-11-07 00:00:00+00:00','2017-12-21 00:00:00+00:00'])
axes[(3,1)].set_ylim([0,320])
axes[(3,1)].axvline('2017-11-07 00:00:00+00:00', color = 'indianred', linestyle='dashed')

image text

As we can see in the plot above, the number of characters used in a tweet varies a lot. (The reason that some tweets have more than 140 characters before 7 November and more than 280 characters after 7 November is that Twitter does not count some links towards the character limit, for example links of images, GIFs or videos. Links to websites do count towards the character limit, however. For the purpose of our exploratory data analysis it is sufficient to work with the tweet texts as they are, as this should even out between MPs accordingly.) From this graph alone, there are clear differences between MPs twitter behaviour visible already. For example, Jeremy Corbyn made use of the new opportunity to use 280 characters as soon as the feature was available, whereas other MPs only increased their number of characters slowly a few days afterwards. Many MPs still send tweets that are way under the character limit, even though they have more space available now. The politician that now uses more than 140 characters most consistently is Foreign Minister Boris Johnson.

2.1.2 All MPs combined

The next step of the exporatory data analysis is to plot the data for all MPs combined, to get a sense of the spread. The two histograms below confirm that the pattern already observed when plotting the 6 MPs above is true for the entire data; some MPs use a low number of characters even once they have 280 characters available, while others make use of the new space extensively.

# calculate the means before and after
before = timelines[timelines.created_at < '2017-11-07 00:00:00+00:00'].groupby(['screen_name']).mean()

after = timelines[timelines.created_at > '2017-11-07 00:00:00+00:00'].groupby(['screen_name']).mean()
after = after.rename(columns={"nchar": "nchar_after", 'retweet_count': 'retweet_count_after',
                              'number_of_hashtags': 'number_of_hashtags_after', 'number_of_mentions': 'number_of_mentions_after'})

# set up the matplotlib figure
fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(15, 5))

sns.distplot(before['nchar'], rug=True, color="b", ax = axes[0]) # historgram, kernel density estimate and rug plot
axes[0].set_title('Before 280 characters were introduced')
axes[0].set_xlabel('Mean number of characters used in tweets by MP')
axes[0].set_xlim([0,280])
axes[0].set_ylim([0,0.03])
axes[0].axvline(140, color = 'indianred', linestyle='dashed')

sns.distplot(after['nchar_after'], rug=True, color="g", ax = axes[1]) # historgram, kernel density estimate and rug plot
axes[1].set_title('After 280 characters were introduced')
axes[1].set_xlabel('Mean number of characters used in tweets by MP')
axes[1].set_xlim([0,280])
axes[1].set_ylim([0,0.03])
axes[1].axvline(140, color = 'indianred', linestyle='dashed')

png

2.1.3 Change in number of characters for MPs

Certainly the next thing we are interested in is, whether those MPs that used a large number of characters before 7 November 2017 are also the ones who are using the most characters afterwards. In other words, we want to know if the change actually inspired MPs who used very few characters to write more elaborate tweets, or if simply the ones who already sent long tweets are now the ones making the most use of the new space that has become available. To get an idea, the scatterplot below shows all MPs, with different colors for each party.

# concatenate before and after
concatenated = pd.concat([before, after], axis=1)

# need to convert index to column screen_name for merge
concatenated['screen_name'] = concatenated.index

# get parties of MPs
party = timelines[['screen_name', 'party']]
party = party.drop_duplicates()

# merge
concatenated_with_party = pd.merge(concatenated, party, on='screen_name')
concatenated_with_party.head()

# add party colors
concatenated_with_party['color'] = 'gray' # create color line
concatenated_with_party.loc[concatenated_with_party['party'] == 'Conservative', 'color'] = 'b'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Labour', 'color'] = 'red'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Democratic Unionist Party', 'color'] = 'magenta'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Green Party', 'color'] = 'green'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Liberal Democrat', 'color'] = 'orange'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Plaid Cymru', 'color'] = 'lightgreen'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Sinn Fein', 'color'] = 'black'
concatenated_with_party.loc[concatenated_with_party['party'] == 'Scottish National Party', 'color'] = 'yellow'

# scatterplot
fig, axes = plt.subplots(figsize=(12, 10))
c  = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Conservative'], c='color')
l  = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Labour'], c='color')
d  = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Democratic Unionist Party'], c='color')
g  = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Green Party'], c='color')
ld = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Liberal Democrat'], c='color')
p  = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Plaid Cymru'], c='color')
s  = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Sinn Fein'], c='color')
sn = plt.scatter(x='nchar', y='nchar_after', data=concatenated_with_party[concatenated_with_party['party'] == 'Scottish National Party'], c='color')

plt.title('Mean number of characters used in tweets by MP', fontsize = 20)
plt.xlabel('Before 7 November 2017 (140 character limit)', fontsize = 14)
plt.ylabel('After 7 November 2017 (280 character limit)', fontsize = 14)

plt.legend(('Conservative', 'Labour', 'Democratic Unionist Party', 'Green Party', 'Liberal Democrat', 'Plaid Cymru', 'Sinn Fein', 'Scottish National Party'),
            scatterpoints=1,
            loc='lower right',
            ncol=2,
            fontsize=8)
plt.show()

png

Is there a correlation?

Looking at the plot above it seems like there could be a high correlation between the number of characters used before the change and the number of characters after the change. The plot below checks for that.

sns.jointplot(x=concatenated_with_party['nchar'], y=concatenated_with_party['nchar_after'], kind='reg', size = 8)

png

As expected, Pearson’s R is 0.52, which shows a high correlation in this case. Below the same plot, but for each party seperately, to see if this correlation is specific to a party or for all parties equally high.

Conservative Party

cons = concatenated_with_party[concatenated_with_party['party'] == 'Conservative']
sns.jointplot(x=cons['nchar'], y=cons['nchar_after'], kind='reg', color='blue', size=8)

png

Labour Party

labour = concatenated_with_party[concatenated_with_party['party'] == 'Labour']
sns.jointplot(x=labour['nchar'], y=labour['nchar_after'], kind='reg', color='red', size=8)

png

Democratic Unionist Party

DUP = concatenated_with_party[concatenated_with_party['party'] == 'Democratic Unionist Party']
sns.jointplot(x=DUP['nchar'], y=DUP['nchar_after'], kind='reg', color='magenta', size=8)

png

Green Party

The Green Party only has 1 Member of Parliament, hence there is no jointplot for it.

Liberal Democrats

lib_dem = concatenated_with_party[concatenated_with_party['party'] == 'Liberal Democrat']
sns.jointplot(x=lib_dem['nchar'], y=lib_dem['nchar_after'], kind='reg', color='orange', size=8)

png

Plaid Cymru

plaid_cymru = concatenated_with_party[concatenated_with_party['party'] == 'Plaid Cymru']
sns.jointplot(x=plaid_cymru['nchar'], y=plaid_cymru['nchar_after'], kind='reg', color='lightgreen', size=8)

png

Sinn Fein

sinn_fein = concatenated_with_party[concatenated_with_party['party'] == 'Sinn Fein']
sns.jointplot(x=sinn_fein['nchar'], y=sinn_fein['nchar_after'], kind='reg', color='black', size=8)

png

Scottish National Party

SNP = concatenated_with_party[concatenated_with_party['party'] == 'Scottish National Party']
sns.jointplot(x=SNP['nchar'], y=SNP['nchar_after'], kind='reg', color='yellow', size=8)

png

After plotting all the parties individually we see that indeed the high correlation holds true for all of them.

2.1.4 Mean number of characters by party

Now that we have seen that there is a high correlation within the parties, we still do not know the exact means from before and after. The boxplots below show that Plaid Cymru uses the highest number of characters before 7 November, whereas the Democratic Unionist Party uses the highest number of characters after 7 November. (Not counting the Green Party, which only has 1 MP.)

# convert to long format
long = pd.melt(concatenated_with_party, id_vars='screen_name', value_vars=['nchar', 'nchar_after'])

# merge party name
long_merged = pd.merge(long, party, on='screen_name')

# boxplots
fig, axes = plt.subplots(figsize=(15, 10))
sns.boxplot(x='value', y='party', hue='variable', data=long_merged)
plt.title('Mean number of characters used in tweets by MP', fontsize = 20)
plt.xlabel('Number of characters', fontsize = 14)
plt.ylabel('Party', fontsize = 12)
plt.legend(loc='lower right')
plt.show()

png

To see the distibution in addition to the boxplots, a split-Violinplot is plotted. As can be seen below, Plaid Cymru has the highest mean for the entire period under investigation.

# split-Violinplot
fig, axes = plt.subplots(figsize=(15, 10))
sns.violinplot(x='value', y='party', hue='variable', data=long_merged, split=True)
plt.title('Mean number of characters used in tweets by party', fontsize = 20)
plt.xlabel('Number of characters', fontsize = 14)
plt.ylabel('Party', fontsize = 12)
plt.show()

png

2.2 Ranking of individual MPs

Further, we would like to know which individual MPs needed the 280 the most and which are actually using less characters now. To that end the difference between afterwards and before is calculated for each individual MP. The MP with the highest increase is Ivan Lewis, who on average used more than 140 characters more (!) after 7 November 2017. The MP with the highest decrease in character use is Andrew Brigden, who on average used more than 35 characters less.

# concatenate (done above already)
concatenated = pd.concat([before, after], axis=1)
#print(concatenated.head())

# add difference after - before
concatenated['nchar_difference'] = concatenated['nchar_after'] - concatenated['nchar']
#print(concatenated.head())

# sort values
ascending = concatenated.sort_values(by='nchar_difference', ascending=True)
descending = concatenated.sort_values(by='nchar_difference', ascending=False)

# plot increase
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(15, 15))
d = descending['nchar_difference'][:20].plot(kind='barh', color='b', alpha=0.4, ax=axes[0])
d.set_title('The 20 MPs who had the largest increase in characters \nfrom before to after the change to 280', fontsize = 20)
d.set_xlabel('Number of characters', fontsize = 10)

# plot decrease
a = ascending['nchar_difference'][:20].plot(kind='barh', color='r', alpha=0.4, ax=axes[1])
a.set_title('The 20 MPs who had the largest decrease in characters \nfrom before to after the change to 280', fontsize = 20)
a.set_xlabel('Number of characters', fontsize = 10)

png

2.3 Going more into detail: What do they use the new space for?

As a next step, the analysis goes into more detail to see if other aspects of MPs Twitter behaviour changed when they were allowed more characters per tweet. It could be expected that if given more characters, MPs would use more hashtags to get better visibility on Twitter, that they would mention more other users in their tweets, and that tweets with more characters get more engagement by other users.

2.3.1 Did they use more hashtags?

The barplot below demonstrates that MPs do not seem to use more hashtags. Some parties do show a small increase, while others show a small decrease. Hence, having more space available does not seem to inspire politicians to write tweets filled with lots of hashtags in order to show up on more Twitter feeds.

# get party means of all variables
party_means = concatenated_with_party.groupby('party').mean()

# get number of hashtags before and after 7 November 2017
hashtags = party_means[['number_of_hashtags', 'number_of_hashtags_after']]

# plot
h = hashtags.plot.bar(edgecolor='none', alpha=0.5, figsize=(12, 8), rot=45, fontsize = 12)
h.set_title('Number of Hasthtags Before and After 7 November 2017, by Party', fontsize = 20)
h.set_xlabel('Party', fontsize = 14)
h.set_ylabel('Average Number of Hashtags', fontsize = 14)

png

2.3.2 Did they mention more usernames?

A different picture emerges when looking at the effects of the increased character limit on the number of mentions of other usernames in tweets. All parties show an increase in user mentions, except the DUP and SNP.

# get number of hashtags before and after 7 November 2017
mentions = party_means[['number_of_mentions', 'number_of_mentions_after']]

# plot
m = mentions.plot.bar(edgecolor='none', alpha=0.5, figsize=(12, 8), rot=45, fontsize = 12)
m.set_title('Number of Mentions Before and After 7 November 2017, by Party', fontsize = 20)
m.set_xlabel('Party', fontsize = 14)
m.set_ylabel('Average Number of Mentions', fontsize = 14)

png

2.3.3 Do tweets with more characters get more engagement?

An even clearer picture emerges for the number of retweets that tweets get when they have more characters. The tweets of all parties get retweeted more once MPs were allowed 280 characters on 7 November 2017, except those of Plaid Cymru. From a politician’s perspective, getting retweeted may be the most important aspect of Twitter, hence MPs will certainly be happy with having more characters available.

# get number of hashtags before and after 7 November 2017
retweets = party_means[['retweet_count', 'retweet_count_after']]

# plot
r = retweets.plot.bar(edgecolor='none', alpha=0.5, figsize=(12, 8), rot=45, fontsize = 12)
r.set_title('Number of Retweets Before and After 7 November 2017, by Party', fontsize = 20)
r.set_xlabel('Party', fontsize = 14)
r.set_ylabel('Average Number of Retweets', fontsize = 14)

png

3. Visualise Data using quanteda

library("feather")
suppressPackageStartupMessages(library("lubridate"))
suppressPackageStartupMessages(library("quanteda"))
suppressPackageStartupMessages(library("tidyverse"))

Read in the data

Data is imported using the feather package.

timelines <- read_feather("timelines_final.feather") # choose correct one

3.1 Use quanteda to display tweet-networks

Finally we can analyse some tweet networks. To this end, we use the quanteda package, (Benoit, 2018). This analysis aims at informing the analyst of which topics dominated the conversation before and after the change to 280 characters. This information is useful because the topics of conversation could affect how many characters are needed to express views through texts.

3.1.1 Compare topics of conversation before and after 7 November 2017

Before 7 November 2017 (140 character limit)

# subset data before 7 November
before <- filter(timelines, created_at < as.Date("2017-11-07 00:00:00"))

# plot
tweet_dfm <- dfm(before$text, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, ('#*'))
toptag <- names(topfeatures(tag_dfm, 50))
tag_fcm <- fcm(tag_dfm)
topgat_fcm <- fcm_select(tag_fcm, toptag)
textplot_network(topgat_fcm, min_freq = 0.5, edge_alpha = 0.8, edge_size = 5)

png

After 7 November 2017 (280 character limit)

# subset data after 7 November
after <- filter(timelines, created_at >= as.Date("2017-11-07 00:00:00"))

# plot
tweet_dfm <- dfm(after$text, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, ('#*'))
toptag <- names(topfeatures(tag_dfm, 50))
tag_fcm <- fcm(tag_dfm)
topgat_fcm <- fcm_select(tag_fcm, toptag)
textplot_network(topgat_fcm, min_freq = 0.5, edge_alpha = 0.8, edge_size = 5)

png

3.1.2 Compare how the parties differ in their topics of conversation

Conservative Party

# create plotting function
plot_network <- function(party_name, color) {
    subset <- subset(timelines, party == party_name)
    tweet_dfm <- dfm(subset$text, remove_punct = TRUE)
    tag_dfm <- dfm_select(tweet_dfm, ('#*'))
    toptag <- names(topfeatures(tag_dfm, 50))
    tag_fcm <- fcm(tag_dfm)
    topgat_fcm <- fcm_select(tag_fcm, toptag)
    textplot_network(topgat_fcm, min_freq = 0.5, edge_alpha = 0.8, edge_size = 5, edge_color = color)
}

plot_network("Conservative", color = "blue")

png

Labour Party

plot_network("Labour", color = "red")

png

Democratic Unionist Party

plot_network("Democratic Unionist Party", color = "magenta")

png

Green Party

The Green Party only has 1 Member of Parliament, hence the network analysis is omitted.

Liberal Democrats

plot_network("Liberal Democrat", color = "orange")

png

Plaid Cymru

plot_network("Plaid Cymru", color = "lightgreen")

png

Sinn Fein

plot_network("Sinn Fein", color = "black")

png

Scottish National Party

plot_network("Scottish National Party", color = "yellow")

png

3.1.3 Network of usernames

The following plot shows a network of the UK MPs in order to get a sense of how they relate to each other.

#Extract most frequently mentioned usernames
user_dfm <- dfm_select(tweet_dfm, ('@*'))
topuser <- names(topfeatures(user_dfm, 50))
# Construct feature-occurrence matrix of usernames
user_fcm <- fcm(user_dfm)
user_fcm <- fcm_select(user_fcm, topuser)
textplot_network(user_fcm,
                 min_freq = 0.5,
                 edge_color = 'orange',
                 edge_alpha = 0.8,
                 edge_size = 5,
                 omit_isolated = TRUE)

png

3.2 Analyse if having more space leads to improvements in writing style of the tweets

3.2.1 Readability

Here we use the quanteda package to conduct a preliminary analysis of if having 280 characters available increased the readability of the tweets. As can be seen below, it did increase slighty.

# readability
require("lubridate")
timelines <- mutate(timelines,
                    year = year(timelines$created_at),
                    month = month(timelines$created_at),
                    day = day(timelines$created_at),
                    date = make_date(year, month, day))

before <- timelines %>%
  subset(., date < as.Date("2017-11-07"))

after <- timelines %>%
  subset(., date >= as.Date("2017-11-07"))

readability_before <- textstat_readability(before$text, measure = "Flesch.Kincaid")
mean_before <- mean(readability_before, na.rm = TRUE)

readability_after <- textstat_readability(after$text, measure = "Flesch.Kincaid")
mean_after <- mean(readability_after, na.rm = TRUE)

# create data frame for plotting
x <- data.frame(time = c('before \n(140 characters)', 'after \n(280 characters)'), readability = c(mean_before, mean_after))
x$time <- factor(x$time, levels = c('before \n(140 characters)', 'after \n(280 characters)'))

# plot
ggplot(x, aes(time, readability)) +
    geom_col() +
    labs(title='Readability of tweets before and after 7 November 2017',
         x='', y='Readability')
# create data frame for plotting
x <- data.frame(time = c('before \n(140 characters)', 'after \n(280 characters)'), readability = c(mean_before, mean_after))
x$time <- factor(x$time, levels = c('before \n(140 characters)', 'after \n(280 characters)'))

# plot
ggplot(x, aes(time, readability)) +
    geom_col() +
    labs(title='Readability of tweets before and after 7 November 2017',
         x='', y='Readability')

png

3.2.2 Lexical diversity

Further, lexical diversity of tweets is analysed. A higher number of characters does not lead to more lexical diversity.

# lexical diversity
# The ordinary Type-Token Ratio (TTR = V / N)
lexical_diversity_before <- textstat_lexdiv(dfm(before$text), measure = "TTR")
div_before <- mean(lexical_diversity_before)

lexical_diversity_after <- textstat_lexdiv(dfm(after$text),  measure = "TTR")
div_after <- mean(lexical_diversity_after)

# create data frame for plotting
x <- data.frame(time = c('before \n(140 characters)', 'after \n(280 characters)'), lexical_diversity = c(div_before, div_after))
x$time <- factor(x$time, levels = c('before \n(140 characters)', 'after \n(280 characters)'))

# plot
ggplot(x, aes(time, lexical_diversity)) +
    geom_col() +
    labs(title='Lexical diversity of tweets before and after 7 November 2017',
         x = '', y = 'Lexical diversity')
# create data frame for plotting
x <- data.frame(time = c('before \n(140 characters)', 'after \n(280 characters)'), lexical_diversity = c(div_before, div_after))
x$time <- factor(x$time, levels = c('before \n(140 characters)', 'after \n(280 characters)'))

# plot
ggplot(x, aes(time, lexical_diversity)) +
    geom_col() +
    labs(title='Lexical diversity of tweets before and after 7 November 2017',
         x = '', y = 'Lexical diversity')

png

Summary

This notebook provided an Exploratory Data Analysis (EDA) of Twitter data on the UK MPs. It showed that there is a high correlation between the number of characters used when only 140 characters were available and when the limit increased to 280 characters. It explored the differences between the parties in their change from pre to post 7 November 2017. It ranked the invidual MPs and found the ones that had the highest increase and the highest decrease in number of characters used. This finding is interesting in particular, because it could not be expected that MPs actuall do not make use of having more space available, and provides an interesting puzzle that can be analysed in future analyses. It was shown that some parties do not use more hashtags, but they mention more usernames. It also found that tweets with a higher number of characters across the board got more engagement.

To provide the background information necessary to judge the change in characters networks of topics of conversations and user mentions were displayed and a preliminary analysis into the effects on readability and lexical diversity was conducted.

References

Benoit, K. (2018). quanteda: Quantitative Analysis of Textual Data. doi: 10.5281/zenodo.1004683, R package version 0.99.22, http://quanteda.io.

Related