Text mining
Data retrieval
Now, let’s move forward to simple text analysis. First, we need to
prepare the data! (as usual)
tokens <- toots %>%
select(id, content) %>% # Keeps only id and text/content of the tweet
unnest_tokens(word, content) # Creates tokens!
tokens
Let’s have a look at word frequencies.
tokens %>%
count(word, sort = TRUE)
This is polluted by small words. Let’s filter that (FIRST
METHOD).
tokens %>% mutate(length = nchar(word))
Data frequencies
Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the
sample!
tokens %>%
mutate(length = nchar(word)) %>%
filter(length > 4) %>% # Keep words with length larger than 4
count(word, sort = TRUE) %>% # Count words
head(21) %>% # Keep only top 12 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
A better way to proceed is to remove “stop words” like “a”, “I”,
“of”, “the”, etc (SECOND METHOD). Also, it would make sense to
remove the search item and “https”.
data("stop_words")
tidy_tokens <- tokens %>%
anti_join(stop_words) # Remove unrelevant terms
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top 15 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
Problem: strange characters remain. We are going to
remove them by converting the text to ASCII format and omit NA
data.
new_stop_words <- c("https", "span", "class", "href", "target", "_blank", "rel", "tag",
"mastodon.social", "ellipsis", "mastodon.online", "mstdn.social", "amp",
"http", "invisible", "03", search_term, tolower(search_term), "d0", "src",
"tags", "mention", "noreferrer", "noopener", "nofollow", "hashtag", "translate",
"www", "url", "die", "der", "und", "a", "p", "br", "1", "2", "01", "02")
tidy_tokens <- tokens %>%
anti_join(stop_words) %>% # Remove unrelevant
mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
na.omit() %>% # Remove missing
filter(nchar(word) > 2, # Remove small words
!(word %in% new_stop_words) # search_term defined above
)
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(30) %>% # Keep only top words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
Perfect!
n-grams
See https://www.tidytextmining.com/ngrams.html
toots %>%
mutate(id = 1:nrow(toots)) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
filter(!(bigram %in% c("span a", "a href", "a a", "_blank span", "href https", "span class",
"a p", "p p", "br a", "target _blank", "noreferrer target", "span span",
"hastag rel", "class mention", "rel nofollow", "mention hashtag",
"class invisible", "invisible https", "p a",
"nofollow noopener", "noopener noreferrer", "hashtag rel"))) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(20) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) +
geom_col() + theme_bw() + ylab("bigrams")
Again: same issue with stop words! So we must remove them again. But
it’s more complicated now. We can use the separate() function
to help us.
toots %>%
mutate(id = row_number()) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
na.omit() %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
filter(!(word1 %in% c(new_stop_words, stop_words$word)),
!(word2 %in% c(new_stop_words, stop_words$word))) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(24) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram") +
theme_bw()
cloud_data <- toots %>%
mutate(id = row_number()) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
na.omit() %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
filter(!(word1 %in% c(new_stop_words, stop_words$word)),
!(word2 %in% c(new_stop_words, stop_words$word))) %>%
group_by(bigram) %>%
count(sort = T) |>
mutate(length = nchar(bigram)) |>
filter(length < 15)
cloud_data
wordcloud(words = cloud_data$bigram,
freq = cloud_data$n, min.freq = 10,
max.words = 35, random.order = FALSE, rot.per = 0.10,
colors = brewer.pal(8, "Dark2"))
Sentiment
This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want
to download data (lexicons).
Just say yes in the console (type the correct answer:
if not, you will be blocked/struck).
First, we need to load some sentiment lexicon. AFINN is one such
sentiment database.
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
afinn |> filter(value > 3)
To create a nice visualization, we need to extract the
time of the tweets.
tokens_time <- toots %>%
mutate(id = row_number()) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(word, content) # Creates tokens!
tokens_time
We then use inner_join() to merge the two sets. This
function removes the cases when a match does not occur.
library(lubridate)
sentiment <- tokens_time %>%
inner_join(afinn) %>%
mutate(day = day(created_at),
hour = hour(created_at) / 24,
minute = minute(created_at) / 60 / 24,
time = day + hour + minute)
Joining with `by = join_by(word)`
sentiment
We then compute the average sentiment, minute-by-minute, or
day-by-day, depending on frequency.
Of course, average sentiment can be misleading. Indeed, if a text
contains the terms “I’m not happy”, then only “happy”
will be tagged, which is the opposite of the intended meaning.
sentiment %>%
mutate(date = as.Date(created_at)) |>
group_by(date) %>%
#filter(year(date)==2024) |>
summarise(avg_sentiment = mean(value)) %>%
ggplot(aes(x = date, y = avg_sentiment)) + geom_col() + theme_bw()
What about emotions? The NRC lexicon categorizes
emotions. Below, we order emotions. The most important impact
is the dichotomy between positive & negative emotions.
nrc <- get_sentiments("nrc")
nrc <- nrc %>%
mutate(sentiment = as.factor(sentiment),
sentiment = recode_factor(sentiment,
joy = "joy",
trust = "trust",
surprise = "surprise",
anticipation = "anticipation",
positive = "positive",
negative = "negative",
sadness = "sadness",
anger = "anger",
fear = "fear",
digust = "disgust",
.ordered = T))
nrc
We then create the merged dataset.
emotions <- tokens_time %>%
inner_join(nrc) %>% # Merge data with sentiment
mutate(date = as.Date(created_at)) # Create day column
Joining with `by = join_by(word)`Warning: Detected an unexpected many-to-many relationship between `x` and `y`.
emotions # Show the result
The merging has reduced the size of the dataset, but there still
remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each
day.
g <- emotions %>%
group_by(date, sentiment) %>%
summarise(intensity = n()) %>%
filter(year(date) == 2024) |>
ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(g)
This can also be shown in percentage format.
g <- emotions %>%
group_by(date, sentiment) %>%
filter(year(date) == 2024) |>
summarise(intensity = n()) %>%
ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw() +
geom_hline(yintercept = 0.5, linetype = 2)
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(g)
emotions %>%
mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>%
group_by(date, sentiment) %>%
summarise(intensity = n()) %>%
ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme_bw() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
geom_hline(yintercept = 0.5) +
scale_fill_manual(values = c("#223333", "#FFBB99"))
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
Advanced sentiment
The problem with the preceding methods is that they don’t take into
account valence shifters (i.e., negators, amplifiers
(intensifiers), de-amplifiers (downtoners), and adversative
conjunctions). If a tweet says not happy, counting the word
happy is not a good idea! The package sentimentr is
built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org and the book
Supervised Machine Learning for Text Analysis in R
hosted at https://smltar.com)
I haven’t tested aws.comprehend, but it seems
promising: https://github.com/cloudyr/aws.comprehend
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
First, let’s keep only the tweets written in English!
# toots_en <- toots %>%
# mutate(language = textcat(content)) %>%
# filter(language == "english") %>%
# dplyr::select(created_at, content)
toots_en <- toots |> filter(language == "en")
NOTE: the code above was used to show the function
textcat: the language is already coded in the tweets via the
lang column/variable. (it suffices to keep the
instances for which lang == “en”)
Next, we compute advanced sentiment.
tweet_sent <- toots_en$content %>%
get_sentences() %>% # Intermediate function
sentiment() # Sentiment!
tweet_sent
NOTE: depending on frequency issues, it is better to
analyze at daily or hourly scales. If a word is very popular, then,
higher frequencies are more relevant.
ggplot(aes(x = date, y = avg_sent)) + geom_col()
Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`, not a <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?
Backtrace:
1. ggplot2::ggplot(aes(x = date, y = avg_sent))
2. ggplot2:::ggplot.default(aes(x = date, y = avg_sent))
4. ggplot2:::fortify.default(data, ...)
