# Introduction

Ever since COVID-19 reached the shores of Ghana, the president of Ghana (Nana Addo Dankwa Akufo-Addo) has been giving speeches to the nation on the COVID-19 situation in Ghana the Government’s response to this. Most, but not all, of these speeches are available transcribed on the official website of the presidency (http://www.presidency.gov.gh). I thought this would be a good moment to analyse these speeches. In this walkthrough, I will show how to use rvest to scrape the speeches from the web and how to use tidytext and SpacyR in R to do sentiment analysis and natural language processing on these speeches to analyse them. Instead of using ggplot2 for the visualizations, I used plotly to make some interactive visualizations for a change.

# Packages

These are the packages I use for this blog.

library(rvest)
library(tidyverse)
library(lubridate)
library(stringr)
library(spacyr)
library(tidytext)
library(plotly)
library(widyr)
library(igraph)
library(ggraph)
library(showtext)
library(ggtext)
library(kableExtra)
library(data.table)

# I need this font for some of the plots

# Webscrape the Speeches

website_scraper_f <- function(url){

# get the page links to different pages
# this is needed because only 5 blogs (speeches) are shown per page
html_nodes('.pagenav') %>%
html_attr("href") %>%
paste0("http://www.presidency.gov.gh", .) %>%
unique()

# For some reason the attributes are not saved the xml attributes in the same way,
# for example, wikipedia and other other websites do.
# so the links cannot be directly taken from the xml attributes.
# That is no problem though, as I can save the data as text and then use
# regular expressions to take out the relevant parts.
# I just manually inspected the first page and defined some cuts in the
# text that are needed to get the different attributes

# Because of this I just load the whole element as a character string
# and extract the relevant parts. To do this I need to see which characters
# 'surround' a certain element (for example URL) and the cut the string there.

title_cut_a <-  'itemprop=\"url\">\n\t\t\t\t\t'
title_cut_b <- '</a>\n\t\t\t\t\t\t\t'

date_cut_a <-  '<time datetime=\"'
date_cut_b <- '\" itemprop=\"dateCreated\"'

# create an empty list to store results
list_pages <- list()

# this is the HTML element that links to a speech

# transform that to a character string
as.character() %>%

# make it a tibble format
enframe() %>%

# using the different string I defined above I can no lift the title, link to and date of
# the different speeches. This method will only work on this specific website
mutate(title = str_match(value, paste0(title_cut_a,"\\s*(.*?)\\s*", title_cut_b))[,2],
date = str_match(value, paste0(date_cut_a,"\\s*(.*?)\\s*", date_cut_b))[,2] %>%
ymd_hms() %>%
as_date())
}

# combine the list of pages into one dataframe
list_pages_df <- do.call( "rbind", list_pages)

# create a helper function that opens a link and saves only the raw text on the page
html_nodes(xpath = '//*[@itemprop="articleBody"]') %>%
html_text()
}

# Vectorize this function to speed it up and let it easily work in a mutate command

# apply this function to all the links
list_pages_df <- list_pages_df %>%

# return the output
list_pages_df
}

Now I have this function and I can just use it on the website of the presidency to crawl through it and return all the speeches given by the president to me in one neat dataset.

speeches_Ghana_President <- website_scraper_f("http://www.presidency.gov.gh/index.php/briefing-room/speeches")

For this blog, I am only interested in the speeches that are related to COVID-19, so I filter for those. Address To The Nation By President Akufo-Addo On Updates To Ghana’s Enhanced Response To The Coronavirus Pandemic’ and President Akufo-Addo Addresses Nation On Updates To Ghana’s Enhanced Response To The Coronavirus Pandemic are actually the same speech. So I will keep only one.

covid_speeches <- speeches_Ghana_President %>%
select(-value, -name) %>%
# use regular expressions to filter for COVID-19 related speeches
filter(grepl("Corona", title) |
grepl("COVID", title)) %>%

So far 11 speeches are available on COVID-19. I think the president gave more speeches, but those only exist in a video format and were not (yet) transcribed on the website of the presidency.

# Prepare Data

Before I start with the actual sentiment analysis. I am going to inspect the data a little further. But even before that, the data needs to be cleaned and reformatted. There are different packages available in R to work with text data, but I am going to use SpacyR and tidytext. Some standard steps in text analyses include sentence segmentation (Sentenization), tokenization and lemmatization. Sentenization is the splitting of text into different sentences, tokenization refers to the process of splitting text into tokens, usually words, and lemmatization means text normalization (while taken into consideration the Morphology of the word). In simpler terms, lemmatization sets a word to its root. So, for example the lemma of “cats” is*” cat”* and the lemma of *“played”* is *“play”*. Thus, if I lemmanize this sentence *“I am wrote three articles on using OSM maps of Ghana”* I would get *“I be write three article on use OSM map of Ghana."* For more information on these processes please refer to this or this blog. For now, it is just important to know that lemmatization is needed because sentiments are also expressed in lemmas.

To do the actual lemmatization I use the SpacyR package. This package is “an R wrapper to the spaCy “industrial strength natural language processing”” Python library from https://spacy.io." However, to split the speeches into individual sentences, I use the tidytext package. I have found this package to be better as splitting sentences than the SpacyRsentencizer. This use of combination of packages means that I need to use a for loop.

## Start SpacyR

SpacyR is merely a wrapper around the Python library Spacy. So to start it, I need to initialize a Python session. Thankfully, SpacyR does that for me smoothly. I also need to set the language of this instance. You can imagine that to do the tokenization and lemmanization of English text you need an extensive English language dictionary. SpacyR supports a whole set of languages.

# This is only needed the first time you use spacy
# spacy_install()

# start the English language dictionary of spacy
spacy_initialize(model = "en_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")

## Lemmanize speeches

I have found SpacyR’s lemmatization is a bit smarter than some other solutions. Instead of just looking at the words, Spacy inspects the ‘entity’ of the word prior to lemmatization This means words are first put in context, before they are lemmatized. This way, for example, the ‘Greater’ in ‘Greater Accra Region’ does not get lemmatized to ‘great’, but remains ‘Greater’ because it is part of the entity Greater Accra Region, but a greater cost is lemmatized to ‘a’, ‘great’, and ‘cost.’

# get the title of the different speeches
speech_titles <- covid_speeches$title # create an empty list list_speeches <- list() # loop over the different speeches for(i in 1:length(speech_titles)){ # use tidytext sentencizer to separate text into sentences. sentences <- covid_speeches %>% filter(title == speech_titles[i]) %>% unnest_tokens(sentence, # this is the name of the output text, # this is the name of the input token = "sentences", # this indicates that I want to have sentences as my output drop = T, # I want to drop the original text to_lower = F) # I do not, yet, want to transform to lowercase letters # Now save the different sentences in one vector and numerically name it speeches <- sentences$sentence
names(speeches) <- 1:length(speeches)

# Now I use SpacyR to parse the sentences
list_speeches[[i]] <- spacy_parse(speeches,          # input file
lemma = TRUE,      # yes, I want the lemmas
entity = TRUE,
nounphrase = TRUE, #Yes, I want to separate the text into nounphrases

# some additional attributes that spacyr can return
# include whether a token is punctuation
# or a stop word
"is_stop")) %>%
# Now I can just save the output as a tibble
as_tibble() %>%
# Delete the sentence ID
select(-sentence_id) %>%
# spacyR has named the input as doc_id, but in this case the output ids are sentences
rename(sentence_nr = doc_id) %>%
# I want to add the name of the speech as a column
mutate(speech =  speech_titles[i])
}

# I now have a list with the data for 11 speeches, so I can bind those together in 1 dataframe
parsedtxt <- do.call("rbind", list_speeches) %>%
as_tibble() %>%
# make sure the sentence number is numeric
mutate(sentence_nr = as.numeric(sentence_nr))

This is how the output now looks like:

kable() %>%
kable_styling(c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "300px")
sentence_nrtoken_idtokenlemmaposentitynounphrasewhitespaceis_punctis_stopspeech
13ThetheDETbegTRUEFALSETRUEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
14NationnationNOUNend_rootTRUEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
16ThetheDETbegTRUEFALSETRUEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
17PresidentPresidentPROPNend_rootTRUEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
19ThetheDETGPE_BbegTRUEFALSETRUEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
110RepublicRepublicPROPNGPE_Iend_rootFALSEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
111,,PUNCTTRUETRUEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
112NanaNanaPROPNTRUEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
115AkufoAkufoPROPNFALSEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
116--PUNCTFALSETRUEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
118,,PUNCTTRUETRUEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
14GhanaGhanaPROPNbeg_rootFALSEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
15’s’sPARTTRUEFALSETRUEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
11EnhancedEnhancedPROPNbegTRUEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
12ResponseResponsePROPNend_rootTRUEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
14ThetheDETbegTRUEFALSETRUEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
15CoronavirusCoronavirusPROPNmidTRUEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
16PandemicPandemicPROPNend_rootFALSEFALSEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus
17,,PUNCTTRUETRUEFALSEUpdate No 14: Measures Taken To Combat Spread Of Coronavirus

# Visualize Raw Data

Now we have a dataset with 29,925 rows, in which every row represents a word. Before I start with the actual sentiment analysis, I would like to visualize the data a bit.

## Zipf’s Law

One common way of visualizing text data is by counting how often certain words are used in the text. One can then also compare that to how often you would expect words to be used, using Zipf’s law. To quote Wikipedia: “Zipf’s law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc." This is quite an interesting finding and apparently holds true in many many many situations. This is an interesting paper on it. In formula form, Zipf’s law for the frequency n of the r$^{th}$ most frequent word of a text follows can be expressed as:

$$n(r) \propto \frac{1}{r^{\alpha}}$$ where $\alpha$ is a constant close to 1.

Now, let’s see if the speeches of the president follow this law, assuming that $\alpha = 1$. First I clean the text a bit, by removing punctuation, spaces, and numbers from the text. Cleaning the output of spacy_parse() is quite easy. Every token gets a POS (part-of-speech) label, which indicates whether a word is an adjective, pronoun, etc. But it also separates things like spaces that are created by the HTML website input, such as ("\r\n”), punctuation such as commas, points and parentheses. For a full list of different POS labels, check the Spacy documentation. I also, with the exception of 4 words transform all tokens to lowercase, this way words that are capitalized at the beginning of a sentence are not double counted as two different words.

parsedtxt_clean <- parsedtxt %>%
filter(!c(pos %in% c("PUNCT", "SPACE", "NUM"))) %>%
filter(!token %in% c("%", "&", "-")) %>%
# normalize most words to be lower case  with 4 exceptions.
mutate(token = ifelse((grepl("^Ghana", token) |
grepl("^Accra", token)|
grepl("COVID-19", token)|
grepl("Coronavirus", token)),  token, tolower(token))) %>%
count(token ) %>%
arrange(desc(n)) %>%
# I need this data.table frank() function to do a proper ranking
mutate(rank = frank(desc(n),  ties.method = "dense")) %>%
mutate(expected = n[1] / rank) %>%
mutate(prop = n / sum(n),
prop_cum = cumsum(prop),
expected_prop = expected / sum(expected)) %>%
# I only want to keep the words that are used more than 3 times
filter(n > 3) %>%
group_by(rank) %>%
# This part is only added for the vizualization
# This way words with the same frequency do not overlap
mutate(n_words = n()) %>%
mutate(group_id = row_number()-1) %>%
mutate(n_stacked = n + group_id*2.5 ) %>%
ungroup()

Now I will use plotly to visualize the word counts. plotly uses HTML tags instead of regular expressions (like ggplot2 would). So to add a line break, I have to use <br> instead of \n and to boldface text I use <b> and </b>.

# I want to annotate some words in the plot.
annotationed_words <- parsedtxt_clean %>%
filter(token %in% c("COVID-19", "Coronavirus", "Ghana", "virus"))

fig1 <- plot_ly(data = parsedtxt_clean,  # data
type = 'scatter',        # type of plot
mode = 'markers',
width = 1200, height = 500) %>%
x = ~rank,             # x-axis
y = ~n_stacked,        # y-axis
text = ~paste("<b>",   # text that is show when hovering over point
token,
"</b>",
'<br>n times mentioned:',
"<b>",
n ,
"</b>"),
hoverinfo = 'text', # make sure only the text is shown an nothing else
marker = list(color='#006B3F',  # format look of points
opacity = 0.5,
size = 7,
line = list(
color = '#006B3F',
opacity = 0.9,
width = 3)),
name = 'observed', # give it a legend title
showlegend = T
) %>%
# add a line for the observed values
x = ~rank,
line = list(shape = "spline",
color='#006B3F',
size = 5),
name = 'observed curve',
mode = 'lines',
type = 'scatter',
hoverinfo = "none") %>%
# add a line for the expected values in red
x = ~rank,
line = list(shape = "spline",
color = '#CE1126',
size = 5),
name = "expected using Zipf's law",
mode = 'lines',
type = 'scatter',
hoverinfo = "none")  %>%
# format layout
layout(title = list(text = "Word Count COVID-19 Speeches by President of Ghana", y = .99),
annotations = list(
x = annotationed_words$rank, y = annotationed_words$n_stacked,
text = annotationed_words$token, xref = "x", yref = "y", # Manually make sure labels do not overlap # for Coronavirus and COVID-19 ax = c(-20, -20, -20, 20), ay = c(-40, -40, -30, -40), font = list(color = '#706f6f', family = 'Raleway', size = 14), showarrow = TRUE, arrowhead = 1), # add green rectangle shapes = list(type = 'rect', xref = 'x', x0 = 80, x1 = as.numeric(max(parsedtxt_clean$rank) + 1),
yref = 'y',
y0 = 0,
y1 = as.numeric(parsedtxt_clean[which.max(parsedtxt_clean$group_id), "n_stacked"] + 20), fillcolor = '#006B3F', line = list(color = '#006B3F'), opacity = 0.2), # change font font = list(color = '#706f6f', family = 'Raleway', size = 14), # format axis xaxis = list(title = "rank of word use", showline = TRUE, showgrid = FALSE, showticklabels = TRUE, linecolor = '#706f6f', linewidth = 2, autotick = TRUE, ticks = 'outside', tickcolor = '#706f6f', tickwidth = 2, ticklen = 5, tickfont = list(family = 'Raleway', size = 12, color = '#706f6f')), yaxis = list(title = "word count", showline = TRUE, showgrid = FALSE, showticklabels = TRUE, linecolor = '#706f6f', linewidth = 2, autotick = TRUE, ticks = 'outside', tickcolor = '#706f6f', tickwidth = 2, ticklen = 5, tickfont = list(family = 'Raleway', size = 12, color = '#706f6f')), showlegend = TRUE, legend = list(x = 0.65, y = .95)) %>% # add extra annotation add_annotations( x = 80, y = as.numeric(parsedtxt_clean[which.max(parsedtxt_clean$group_id), "n_stacked"] + 20),
text= "Words with the same frequency<br>are stacked alphabetically",
arrow = list(color="#006B3F")) %>%
# omit control panel
config(displayModeBar = FALSE)

fig1

Indeed the speeches by the president seem to follow the Zipf’s law function quite closely. He stays a bit above the curve, indicating a slightly larger than expected vocabulary. In total there are 3,358 different words in the speeches, but you only need 71 different words to get a majority (+50%) of the words used.

parsedtxt %>%
filter(!c(pos %in% c("PUNCT", "SPACE", "NUM"))) %>%
filter(!token %in% c("%", "&", "-")) %>%
# normalize most words to be lower case  with 4 exceptions.
mutate(token = ifelse((grepl("^Ghana", token) |
grepl("^Accra", token)|
grepl("COVID-19", token)|
grepl("Coronavirus", token)),  token, tolower(token))) %>%
count(token ) %>%
arrange(desc(n)) %>%
mutate(prop = n / sum(n),
prop_cum = cumsum(prop)) %>%
summarise(number_of_words = n(), # counts total number of unique words
# This check which is the first word (ordered by frequency)
# that has a cumulative proportion higher than 0.5 (50%)
nearest_to_50pc = which.max(prop_cum -0.5 > 0 & !duplicated(prop_cum -0.5 > 0)))
## # A tibble: 1 x 2
##   number_of_words nearest_to_50pc
##             <int>           <int>
## 1            3358              71

## Most Common Bi-grams

Instead of looking at simple word counts, I can extend this to bi-grams. Bi-grams measure how often two words are mentioned together. The tidytext package offers an easy way to calculate bi-grams. I can simply add token = "ngrams", n = 2 to the unnest_tokens() function. Of course to get meaningful bi-grams, I had to remove stops words, such as “the” and “a.”

covid_speeches %>%
# get bi-grams
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
# separate bi0grams into different columns
separate(bigram, c("word1", "word2"), sep = " ") %>%
# use tidytext stop word lexicon to filter stop words (the, a, etc.)
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>%
# count the stop words
count(word1, word2, sort = TRUE) %>%
# only show top-15
slice(1:15) %>%
kable() %>%
kable_styling(c("striped", "hover", "condensed"))
word1word2n
socialdistancing47
fellowghanaians42
covid1930
healthworkers25
coronaviruspandemic17
enhancedhygiene16
hygieneprotocols15
distancingprotocols14
cedisgh12
teachingstaff12
enhancedresponse11
fiftythousand11
godbless11
homelandghana11

Clearly, COVID-19 related bi-grams (“social distancing”, “health workers”, “Coronavirus pandemic” ) are relatively common (as would be expected on speeches on COVD-19).

Using the methods described in this chapter, We can also calculate the correlation between different words. Using the phi ($\phi$) coefficient, we can calculate how often words appear together relative to how often they appear separately. To define what ‘together’ means I need to subset the text in different sections. I have decided to do that by dividing the text in sections of 3 sentences each. This is a bit arbitrary, but considering the length of the speeches I do think this works well.

The Tidytext book does a good job explaining this phi ($\phi$) coefficient correlation:

“The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other. Consider the following table:"

Has word YNo word YTotal
Has word X$n_{11}$$n_{00}$$n_{1\cdot}$
No word X$n_{01}$$n_{00}$$n_{0\cdot}$
Total$n_{\cdot1}$$n_{\cdot0}$$n$

*“For example, that $n_{11}$ represents the number of [sections] where both word X and word Y appear, $n_{00}$ the number where neither appears, and $n_{10}$ and $n_{01}$ the cases where one appears without the other. In terms of this table, the phi coefficient is:"*

$$\phi = \frac{n_{11}n_{00} - n_{10}n_{01}}{\sqrt{n_{1\cdot} n_{0\cdot} n_{\cdot0} n_{\cdot1}}}$$

To actually calculate this correlation coefficient, I make use of the pairwise_cor() function of the widyr package. I then use ggplot2 with the ggraph extension to create a graph visualization of these correlations. To get the actual input of the plotting function I use the graph_from_data_frame() function from the igraph package.

covid_speeches %>%
# split text into sentences
unnest_tokens(sentence, text,
token = "sentences",
drop  = T,
to_lower = F) %>%
# create different sections (per speech) of 3 sentences
group_by(title) %>%
mutate(section = (row_number() %/% 3) + 1) %>%
ungroup() %>%
# split text into actual tokens
unnest_tokens(word, sentence) %>%
# filter out stop words
filter(!word %in% stop_words\$word) %>%
group_by(word) %>%
# omit words that are uncommon
filter(n() >= 22) %>%
# calculate correlations
pairwise_cor(word, section, sort = TRUE) %>%
# omit combinations with a low correlation
# otherwise the plot clutters
filter(correlation > .5) %>%
# create a graph data frame  using igraph
graph_from_data_frame() %>%
# create graph visualization
ggraph(layout = "fr") + #, algorithm = 'kk') +
# beautify plot a bit
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "grey", show.legend = FALSE) +
# green colour of the Ghanaian flag
geom_node_point(color = "#006B3F", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 10, family = "Raleway") +
theme_void() +
labs(title = "Correlations Between Most Common Words in Speeches") +
theme(plot.title  = element_text(family = "Raleway", size =30))

I think this is quite an insightful visualization, that shows which words are commonly used together.

# Sentiment analysis.

Now I have inspected the data and determined that it looks good to me, it is time to do the actual sentiment analysis. The most important part of doing a sentiment analysis is getting a lexicon (dictionary) with words with corresponding sentiments. There are many of such dictionaries available, some of which are free to use and others that require a license. I recommend this, this, or this page if you are interested in reading a bit more on available lexicons. I am using the Afinn lexicon. This lexicon scores words on a scale from -5 to -1 for words with a negative sentiment and from 1 to 5 for words with a positive sentiment.

as_tibble() %>%
rename(word = 1,
sentiment = 2)

This is how this dataset looks:

Afinn %>%
tail() %>%
kable() %>%
kable_styling(c("striped", "hover", "condensed"))
wordsentiment
youthful2
yucky-2
yummy3
zealot-2
zealots-2
zealous2

Now I can merge these sentiments to the text. After that I can compare the average sentiments of different quintiles of the speeches. Furthermore, I will create some extra statistics, such as the average length (number of sentences) of the speeches

sentence_length <- parsedtxt %>%
filter(!c(pos %in% c("PUNCT", "SPACE", "NUM"))) %>%
filter(!token %in% c("%", "&", "-")) %>%
group_by(speech, sentence_nr ) %>%
summarise(n_words = n(), .groups = "drop") %>%
group_by(speech) %>%
summarise(mean_sentence_length = mean(n_words),
n_sentences = n()) %>%
left_join(covid_speeches %>% select(title, date),
by = c("speech" = "title"))
parsedtxt %>%
# join sentiments
left_join(Afinn, by = c("lemma"= "word")) %>%
group_by(speech,  sentence_nr) %>%
# if there is no sentiment in the sentence, I make the sentiment 0.
mutate(sentiment = ifelse(is.na(sentiment), 0, sentiment)) %>%
# calculate the mean sentiment per sentence
summarise(sentiment_mean = mean(sentiment, na.rm =T), .groups = "drop") %>%
group_by(speech) %>%
# split text into quintiles
arrange(speech,  sentence_nr) %>%
mutate(section = ntile( sentence_nr , 5)) %>%
# calculate the mean sentiment per quintiles
group_by(speech, section) %>%
summarise(mean_Sep = mean(sentiment_mean)) %>%
# spread the results from a long to a wide format
pivot_wider(values_from  = mean_Sep,
names_from = section,
names_prefix = "section_") %>%
# compare the mean of the first 4 section to the last section
mutate(diff_sec_sec1_4 = section_5 - mean(c(section_1,section_2, section_3, section_4) )) %>%
ungroup() %>%
right_join(sentence_length, by = "speech") %>%
arrange(date) %>%
# select columns
select(speech, date, n_sentences, mean_sentence_length, section_1:diff_sec_sec1_4) %>%
# round values
mutate(across(where(is.numeric),  ~round(., 4))) %>%
kable() %>%
kable_styling(c("striped", "hover", "condensed"))
speechdaten_sentencesmean_sentence_lengthsection_1section_2section_3section_4section_5diff_sec_sec1_4
President Akufo-Addo Addresses Nation On Measures Taken By Gov't To Combat The Coronavirus Pandemic2020-03-152723.22220.0055-0.00510.01120.07130.15880.1381
Address To The Nation By President Akufo-Addo On Updates To Ghana’s Enhanced Response To The Coronavirus Pandemic2020-03-2712817.12500.05670.00000.01350.01480.11070.0894
Address To The Nation By President Akufo-Addo On Updates To Ghana’s Enhanced Response To The Coronavirus Pandemic2020-04-0510825.89810.06810.01110.03230.00370.04970.0208
President Akufo-Addo Provides Update On Ghana’s Enhanced Response To COVID-192020-05-108028.40000.0305-0.0105-0.01440.05410.09360.0787
Update No.10: Measures Taken To Combat Spread Of Coronavirus2020-05-319427.19150.02650.01130.00150.03510.09220.0736
Update No.11: Measures Taken To Combat Spread Of Coronavirus2020-06-148626.12790.04600.02840.0471-0.05020.06150.0437
Update No 12: Measures Taken To Combat Spread Of Coronavirus2020-06-219124.49450.0592-0.0010-0.0122-0.01860.04710.0403
Update No 13: Measures Taken To Combat Spread Of Coronavirus2020-06-286625.10610.04260.0130-0.05100.02190.05770.0510
Update No 14: Measures Taken To Combat Spread Of Coronavirus2020-07-2611224.80360.06760.04220.00350.02180.0253-0.0085

With only one exception, the latest speech by the president, all speeches had a final quintile that is more positive than the first 80% of the speech.

The same pattern can be observed if I plot the average sentiment per sentence compared to the sentence number. In this plotting I made the decision to only plot the a smoothed line through the different points, instead of the points themselves (that would be too cluttered). I use a Loess function (a generalized Savitzky–Golay filter with variable window size) to do the smoothing. My main difficulty with the plotting was that I wanted to create a function that shows the actual text of a sentence with positive words highlighted green and negative words highlighted in different shades of red.

colfunc_red <- colorRampPalette(c("#ff5252","#a70000"))
colfunc_green <- colorRampPalette(c("#A3D12D","#086D44"))

# get text data and use html tags to colour words green or red
text <- parsedtxt %>%
left_join(Afinn, by = c("lemma"= "word")) %>%
mutate(sentiment = ifelse(is.na(sentiment), 0, sentiment)) %>%
mutate(token_colour = ifelse(sentiment < 0,
paste0('<span style="color:',
colfunc_red(abs(min(sentiment)))[abs(sentiment)],
';"><b>',  token,'</b></span>'),
ifelse(sentiment > 0,
paste0('<span style="color:',
colfunc_green(abs(max(sentiment)))[abs(sentiment)],
';"><b>', token,'</b></span>'),
token))) %>%
mutate(nchar_token = nchar(token)) %>%
group_by(speech, sentence_nr) %>%
arrange(sentence_nr) %>%
# I also want to break up long sentences
mutate(nchar_token_cum = cumsum(nchar_token)) %>%
mutate(split = which.min(abs(nchar_token_cum - 60))) %>%
mutate(split2 = which.min(abs(nchar_token_cum - 120))) %>%
mutate(split3 = which.min(abs(nchar_token_cum - 180))) %>%
mutate(split4 = which.min(abs(nchar_token_cum - 240))) %>%
mutate(token_colour = ifelse(token_id == split & max(nchar_token_cum) >=120 ,
paste0(token_colour, "<br>"),
token_colour)) %>%
mutate(token_colour = ifelse(token_id == split2 & max(nchar_token_cum) >=120 ,
paste0(token_colour, "<br>"),
token_colour)) %>%
mutate(token_colour = ifelse(token_id == split3 & max(nchar_token_cum) >=180 ,
paste0(token_colour, "<br>"),
token_colour)) %>%
mutate(token_colour = ifelse(token_id == split4 & max(nchar_token_cum) >=240 ,
paste0(token_colour, "<br>"),
token_colour)) %>%
# Now I have added the colours to the words I can collapse them back into sentences
summarise(collapsed_sentence = paste0(token_colour, collapse=" ")) %>%
mutate(# delete space before comma
collapsed_sentence = gsub(" ,", ",", collapsed_sentence),
# delete space before full stop
collapsed_sentence = gsub(" \\.", ".", collapsed_sentence),
# delete space before exclamation point
collapsed_sentence = gsub(" !", "!", collapsed_sentence),
# delete space around hyphens
collapsed_sentence = gsub(" - ", "-", collapsed_sentence),
# delete space around brackets
collapsed_sentence = gsub("\$$", "(", collapsed_sentence), # delete space around brackets collapsed_sentence = gsub(" \$$", ")", collapsed_sentence)) %>%
ungroup() %>%
# Now I just add the dates to the speeches
left_join(sentence_length %>% select(speech, date), by = "speech")

After formatting the words into sentences, I also need to calculate the average sentiment per sentence and then smooth the data grouped by speech using the broom package. After that I add the dates to the speech labels and order the speeches in chronological order.

dat <- parsedtxt %>%
filter(!c(pos %in% c("PUNCT", "SPACE", "NUM"))) %>%
filter(!token %in% c("%", "&", "-"))  %>%
left_join(Afinn, by = c("lemma"= "word")) %>%
group_by(speech,  sentence_nr) %>%
mutate(sentiment = ifelse(is.na(sentiment), 0, sentiment)) %>%
summarise(sentiment_mean = mean(sentiment, na.rm = T), .groups = "drop") %>%
group_by(speech) %>%
group_by(speech,  sentence_nr) %>%
nest(data = -speech) %>%
# use combination of map and augment to get the fitted values of the Loess regression
mutate(
fit = map(data, ~ loess(sentiment_mean ~ sentence_nr,
data = .x,
span = 0.5)),
augmented = map(fit, augment)) %>%
# unnest the datasets per speech into one big dataset
unnest(augmented)  %>%
select(-data, - fit, - .resid) %>%
ungroup()

# combine the text and speech data
dat2 <- left_join(dat, text, by = c("sentence_nr", "speech" )) %>%
mutate(speech = paste0(speech, " (", date, ")")) %>%
# order speeches
mutate(speech = fct_reorder(speech, date))

I want to colour the lines that represent the different speeches in different in different colours. So I create a colour palette that is based on the colours of the Ghanaian coat of arms (https://www.schemecolor.com/ghana-coat-of-arms-colors.php).

# create a color palette from the colours of the Ghanaian coat of arms
colfunc <- colorRampPalette(c("#CE1126","#FCD116","#006B3F", "#0193DD", "#000000"))
# now all of this can be combined into a plotly plot
fig3 <- plot_ly(data = dat2,
type = 'scatter',
mode = "lines",
color = ~speech,
sort = FALSE,
colors = colfunc(11)) %>%
y = ~.fitted,
x = ~sentence_nr,
text = ~collapsed_sentence,
hoverinfo = 'text',
hoverlabel = list(  bgcolor = "white",
opacity = 0.5,
size = 7,
font= list(family = 'Raleway'),
align = "left",
line = list(
opacity = 0.9,
width = 3)),
showlegend = T)  %>%
layout(title = list(text = "<b>Sentiment of COVID-19 Speeches by the President of Ghana</b><br><sup>Speeches vary in length, but all end on a positive note.</sup>",
x =0.05,
xanchor = "left",
automargin = TRUE),
font = list(color = '#706f6f',
family = 'Raleway',
size = 14),
xaxis = list(title = "Sentence number",
showspikes = TRUE,
spikemode  = 'toaxis',
spikesnap = 'data',
showline = TRUE,
showgrid = FALSE,
showticklabels = TRUE,
linecolor = '#706f6f',
linewidth = 2,
autotick = TRUE,
ticks = 'outside',
tickcolor = '#706f6f',
tickwidth = 2,
ticklen = 5,
tickfont = list(family = 'Raleway',
size = 12,
color = '#706f6f')),
yaxis = list(title = "Average sentence sentiment (smoothed)",
showline = TRUE,
showgrid = FALSE,
showticklabels = TRUE,
linecolor = '#706f6f',
linewidth = 2,
autotick = TRUE,
ticks = 'outside',
tickcolor = '#706f6f',
tickwidth = 2,
ticklen = 5,
tickfont = list(family = 'Raleway',
size = 12,
color = '#706f6f')),
legend = list(title = list(text = "<b>Speeches</b><br><sup>(double click on a speech to isolate it and hover over lines to see text)</sup>"),
x = 0, y = -.8,
xanchor = "left"),
autosize = T,
margin = list(t = 100),
showlegend = TRUE) %>%
y = -0.05,
xanchor = "left",
showarrow = F,
align = "left",
text=  '<span style="color:#a70000;"><b>Negative values signify<br>negative sentiment</b></span>') %>%
y = 0.3,
xanchor = "left",
showarrow = F,
align = "left",
text=  '<span style="color:#A3D12D;"><b>Positive values signify<br>positive sentiment</b></span>') %>%
config(displayModeBar = FALSE)

fig3

# Limitations

Sentiment analysis is a bit controversial, so I think it is important to note some limitations of the analyses presented in this blog:

• Most importantly this analysis did only take into consideration the English language parts of the speeches. In some of the speeches the president did use some Twi. As of now I do not think any Twi languages dictionaries exist.
• I have not make use of any negators in my analysis. So the sentence “I am not happy," might have a positive sentiment, because I did not control for the “not” prior to “happy."
• You can argue about the validity of these sentiment lexicons. For example the word “fight” has a negative sentiment in this lexicon, but maybe you consider a fight (against COVID-19) a positive thing. This is especially true for the word “positive” in the context of COVID-19. The term “positive tests” is used frequently, but is not necessarily a positive thing.

# Valuable Information and Tutorials

These are some further valuable sources on these types of analysis:

All of these sources helped me write this post.

##### Ghana Data Stuff

This website is my hobby project to showcase some of the project I am working one, when I am not working on the official statistics at the Ghana Statistical Service.