Text mining
Data retrieval
Now, let’s move forward to simple text analysis. First, we need to
prepare the data! (as usual)
tokens <- toots %>%
select(id, content) %>% # Keeps only id and text/content of the tweet
unnest_tokens(word, content) # Creates tokens!
tokens
Let’s have a look at word frequencies.
tokens %>%
count(word, sort = TRUE)
This is polluted by small words. Let’s filter that (FIRST
METHOD).
tokens %>% mutate(length = nchar(word))
Data frequencies
Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the
sample!
tokens %>%
mutate(length = nchar(word)) %>%
filter(length > 4) %>% # Keep words with length larger than 4
count(word, sort = TRUE) %>% # Count words
head(21) %>% # Keep only top 12 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
A better way to proceed is to remove “stop words” like “a”, “I”,
“of”, “the”, etc (SECOND METHOD). Also, it would make sense to
remove the search item and “https”.
data("stop_words")
tidy_tokens <- tokens %>%
anti_join(stop_words) # Remove unrelevant terms
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top 15 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
Problem: strange characters remain. We are going to
remove them by converting the text to ASCII format and omit NA
data.
new_stop_words <- c("https", "span", "class", "href", "target", "_blank", "rel", "tag",
"mastodon.social", "ellipsis", "mastodon.online", "mstdn.social", "amp",
"http", "invisible", "03", search_term, tolower(search_term), "d0", "src",
"tags", "mention", "noreferrer", "noopener", "nofollow", "hashtag", "translate",
"www", "url", "die", "der", "und", "a", "p", "br", "1", "2", "01", "02")
tidy_tokens <- tokens %>%
anti_join(stop_words) %>% # Remove unrelevant
mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
na.omit() %>% # Remove missing
filter(nchar(word) > 2, # Remove small words
!(word %in% new_stop_words) # search_term defined above
)
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(30) %>% # Keep only top words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
Perfect!
n-grams
See https://www.tidytextmining.com/ngrams.html
toots %>%
mutate(id = 1:nrow(toots)) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
filter(!(bigram %in% c("span a", "a href", "a a", "_blank span", "href https", "span class",
"a p", "p p", "br a", "target _blank", "noreferrer target", "span span",
"hastag rel", "class mention", "rel nofollow", "mention hashtag",
"class invisible", "invisible https", "p a",
"nofollow noopener", "noopener noreferrer", "hashtag rel"))) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(20) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) +
geom_col() + theme_bw() + ylab("bigrams")
Again: same issue with stop words! So we must remove them again. But
it’s more complicated now. We can use the separate() function
to help us.
toots %>%
mutate(id = row_number()) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
na.omit() %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
filter(!(word1 %in% c(new_stop_words, stop_words$word)),
!(word2 %in% c(new_stop_words, stop_words$word))) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(24) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram") +
theme_bw()
cloud_data <- toots %>%
mutate(id = row_number()) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
na.omit() %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
filter(!(word1 %in% c(new_stop_words, stop_words$word)),
!(word2 %in% c(new_stop_words, stop_words$word))) %>%
group_by(bigram) %>%
count(sort = T) |>
mutate(length = nchar(bigram)) |>
filter(length < 15)
cloud_data
wordcloud(words = cloud_data$bigram,
freq = cloud_data$n, min.freq = 10,
max.words = 35, random.order = FALSE, rot.per = 0.10,
colors = brewer.pal(8, "Dark2"))
Sentiment
This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want
to download data (lexicons).
Just say yes in the console (type the correct answer:
if not, you will be blocked/struck).
First, we need to load some sentiment lexicon. AFINN is one such
sentiment database.
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
afinn |> filter(value > 3)
To create a nice visualization, we need to extract the
time of the tweets.
tokens_time <- toots %>%
mutate(id = row_number()) %>% # This creates a tweet id
select(id, content, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(word, content) # Creates tokens!
tokens_time
We then use inner_join() to merge the two sets. This
function removes the cases when a match does not occur.
library(lubridate)
sentiment <- tokens_time %>%
inner_join(afinn) %>%
mutate(day = day(created_at),
hour = hour(created_at) / 24,
minute = minute(created_at) / 60 / 24,
time = day + hour + minute)
Joining with `by = join_by(word)`
sentiment
We then compute the average sentiment, minute-by-minute, or
day-by-day, depending on frequency.
Of course, average sentiment can be misleading. Indeed, if a text
contains the terms “I’m not happy”, then only “happy”
will be tagged, which is the opposite of the intended meaning.
sentiment %>%
mutate(date = as.Date(created_at)) |>
group_by(date) %>%
#filter(year(date)==2024) |>
summarise(avg_sentiment = mean(value)) %>%
ggplot(aes(x = date, y = avg_sentiment)) + geom_col() + theme_bw()
What about emotions? The NRC lexicon categorizes
emotions. Below, we order emotions. The most important impact
is the dichotomy between positive & negative emotions.
nrc <- get_sentiments("nrc")
nrc <- nrc %>%
mutate(sentiment = as.factor(sentiment),
sentiment = recode_factor(sentiment,
joy = "joy",
trust = "trust",
surprise = "surprise",
anticipation = "anticipation",
positive = "positive",
negative = "negative",
sadness = "sadness",
anger = "anger",
fear = "fear",
digust = "disgust",
.ordered = T))
nrc
We then create the merged dataset.
emotions <- tokens_time %>%
inner_join(nrc) %>% # Merge data with sentiment
mutate(date = as.Date(created_at)) # Create day column
Joining with `by = join_by(word)`Warning: Detected an unexpected many-to-many relationship between `x` and `y`.
emotions # Show the result
The merging has reduced the size of the dataset, but there still
remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each
day.
g <- emotions %>%
group_by(date, sentiment) %>%
summarise(intensity = n()) %>%
filter(year(date) == 2024) |>
ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(g)
This can also be shown in percentage format.
g <- emotions %>%
group_by(date, sentiment) %>%
filter(year(date) == 2024) |>
summarise(intensity = n()) %>%
ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw() +
geom_hline(yintercept = 0.5, linetype = 2)
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(g)
emotions %>%
mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>%
group_by(date, sentiment) %>%
summarise(intensity = n()) %>%
ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme_bw() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
geom_hline(yintercept = 0.5) +
scale_fill_manual(values = c("#223333", "#FFBB99"))
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
Advanced sentiment
The problem with the preceding methods is that they don’t take into
account valence shifters (i.e., negators, amplifiers
(intensifiers), de-amplifiers (downtoners), and adversative
conjunctions). If a tweet says not happy, counting the word
happy is not a good idea! The package sentimentr is
built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org and the book
Supervised Machine Learning for Text Analysis in R
hosted at https://smltar.com)
I haven’t tested aws.comprehend, but it seems
promising: https://github.com/cloudyr/aws.comprehend
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
First, let’s keep only the tweets written in English!
# toots_en <- toots %>%
# mutate(language = textcat(content)) %>%
# filter(language == "english") %>%
# dplyr::select(created_at, content)
toots_en <- toots |> filter(language == "en")
NOTE: the code above was used to show the function
textcat: the language is already coded in the tweets via the
lang column/variable. (it suffices to keep the
instances for which lang == “en”)
Next, we compute advanced sentiment.
tweet_sent <- toots_en$content %>%
get_sentences() %>% # Intermediate function
sentiment() # Sentiment!
tweet_sent
NOTE: depending on frequency issues, it is better to
analyze at daily or hourly scales. If a word is very popular, then,
higher frequencies are more relevant.
ggplot(aes(x = date, y = avg_sent)) + geom_col()
Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`, not a <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?
Backtrace:
1. ggplot2::ggplot(aes(x = date, y = avg_sent))
2. ggplot2:::ggplot.default(aes(x = date, y = avg_sent))
4. ggplot2:::fortify.default(data, ...)
---
title: "Third party data and basic text mining"
output:
  html_document:
    toc: yes
    df_print: paged
  html_notebook:
    toc: yes
    toc_float: yes
---

# The general idea

Data transfer is highly controlled. The key notions are **authentication** and **protocol**.

# Downloading toots with *rtoot*

There are several packages that run an interface with twitter: *rtweet*, *RTwitterAPI*, *streamR* and *twitteR*.	
But since Auth V2, we will need **RTwitterV2**! But this only runs on R v4.2!  
Documentation: https://github.com/MaelKubli/RTwitterV2.    
Recent packages are better because firms update their API policies (and access), thus old protocols sometimes do not work!   
Unfortunately, the Twitter API is no longer free!   
Hence, in this notebook, we will test the competitor: [**mastodon**](https://joinmastodon.org/)!  
The package for this will be [**rtoot**](https://schochastics.github.io/rtoot/).

## First things first

**First**, the packages. Download...

```{r, warning = FALSE, message = FALSE}
if(!require(rtoot)){install.packages("rtoot")}
```

... and activate.

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(plotly)
library(rtoot)
```

## Authentication

**Second**: authentication
You have to choose a particular instance of the network.
Personally, I am registered on "sciences.social", the largest one is "mastodon.social" (see https://mastodonservers.net/servers/top)
=> Write the answer without the quotation marks and choose a public tocken

```{r}
rtoot::auth_setup(
  instance = "mastodon.social",
  type = "public"
)
```

```{r}
# get_timeline_hashtag(hashtag = "rstats", 
#                      instance = "mastodon.social",
#                      limit = 200)
```


Authentication can be an important part of the process. For more info on that:  
- https://cran.r-project.org/web/packages/googlesheets/vignettes/managing-auth-tokens.html   
- https://httr.r-lib.org/reference/index.html (section Authentication)   
- https://blog.r-hub.io/2021/01/25/oauth-2.0/

## Extraction

If no error appears, we are ready to query. Depending on the number of requested tweets, this can take some time.  

There are different types of queries that the packages allows.   
For instance, below we use the **get_timeline_hashtag** function to access toots that include one particular term, the "hashtag".  



```{r}
search_term <- "election"
toots <- get_timeline_hashtag(hashtag = search_term, 
                              instance = "mastodon.social",
                              limit = 2000)
```



# Text mining

## References
The reference book is: https://www.tidytextmining.com      
A great interactive tutorial: https://juliasilge.shinyapps.io/learntidytext/    
And the package is:  

```{r, message = FALSE, warning = FALSE}
if(!require(tidytext)){install.packages("tidytext", repos = "https://cloud.r-project.org/")}
library(tidytext)
```

(see also: https://quanteda.io/index.html)

## Data retrieval

Now, let's move forward to simple text analysis. First, we need to prepare the data! (as usual)

```{r, warning = FALSE, message = FALSE}
tokens <- toots %>% 
    select(id, content) %>%             # Keeps only id and text/content of the tweet
    unnest_tokens(word, content)        # Creates tokens!
tokens
```

Let's have a look at word frequencies.

```{r, warning = FALSE, message = FALSE}
tokens %>%
    count(word, sort = TRUE)
```

This is polluted by small words. Let's filter that (*FIRST METHOD*).

```{r, warning = FALSE, message = FALSE}
tokens %>% mutate(length = nchar(word))
```


## Data frequencies
Now let's omit the small words (smaller than 5 characters).   
**NOTE**: all the thresholds below depend on the sample! 

```{r, warning = FALSE, message = FALSE}
tokens %>%
    mutate(length = nchar(word)) %>%
    filter(length > 4) %>%             # Keep words with length larger than 4
    count(word, sort = TRUE) %>%       # Count words
    head(21) %>%                       # Keep only top 12 words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
```

A better way to proceed is to remove "stop words" like "a", "I", "of", "the", etc (*SECOND METHOD*).
Also, it would make sense to remove the search item and "https".

```{r, warning = FALSE, message = FALSE}
data("stop_words")
tidy_tokens <- tokens %>% 
    anti_join(stop_words)                    # Remove unrelevant terms
tidy_tokens %>%
    count(word, sort = TRUE) %>%             # Count words
    head(20) %>%                             # Keep only top 15 words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
```

**Problem**: strange characters remain. We are going to remove them by converting the text to ASCII format and omit *NA* data. 

```{r, warning = FALSE, message = FALSE}
new_stop_words <- c("https", "span", "class", "href", "target", "_blank", "rel", "tag",
                    "mastodon.social", "ellipsis", "mastodon.online", "mstdn.social", "amp",
                    "http", "invisible", "03", search_term, tolower(search_term), "d0", "src",
                    "tags", "mention", "noreferrer", "noopener", "nofollow", "hashtag", "translate",
                    "www", "url", "die", "der", "und", "a", "p", "br", "1", "2", "01", "02")
tidy_tokens <- tokens %>% 
    anti_join(stop_words) %>%                            # Remove unrelevant
    mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
    na.omit() %>%                                        # Remove missing
    filter(nchar(word) > 2,                              # Remove small words
           !(word %in% new_stop_words)  # search_term defined above
    )
tidy_tokens %>%
    count(word, sort = TRUE) %>%         # Count words
    head(30) %>%                         # Keep only top words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()
```

Perfect!

## Word cloud

This data can also be shown with a word cloud. We simply use the *wordcloud* package: https://cran.r-project.org/web/packages/wordcloud/index.html 

The package *wordcloud2* adds a few features: https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html

```{r, warning = FALSE, message = FALSE, fig.width=8}
if(!require(wordcloud)){install.packages("wordcloud")}
library(wordcloud)
cloud_data <- tidy_tokens %>% count(word)
wordcloud(words = cloud_data$word, 
          freq = cloud_data$n, min.freq = 10,
          max.words = 82, random.order = FALSE, rot.per = 0.15, 
          colors = brewer.pal(8, "Dark2")) 
```

## n-grams

See https://www.tidytextmining.com/ngrams.html

```{r bigrams, message = F, warning = F}
toots %>% 
    mutate(id = 1:nrow(toots)) %>%        # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
    filter(!(bigram %in% c("span a", "a href", "a a", "_blank span", "href https", "span class",
                           "a p", "p p", "br a", "target _blank", "noreferrer target", "span span",
                           "hastag rel", "class mention", "rel nofollow", "mention hashtag",
                           "class invisible", "invisible https", "p a",
                           "nofollow noopener", "noopener noreferrer", "hashtag rel"))) %>%
    group_by(bigram) %>%
    count(sort = T) %>%
    head(20) %>%
    ggplot(aes(y = reorder(bigram, n), x = n)) + 
  geom_col() + theme_bw() + ylab("bigrams")
```

Again: same issue with stop words! So we must remove them again. But it's more complicated now.
We can use the *separate*() function to help us.

```{r}
toots %>% 
    mutate(id = row_number()) %>%         # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
    mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
    na.omit() %>%
    separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
    filter(!(word1 %in% c(new_stop_words, stop_words$word)),
           !(word2 %in% c(new_stop_words, stop_words$word))) %>%
    group_by(bigram) %>%
    count(sort = T) %>%
    head(24) %>%
    ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram") +
    theme_bw()
```

```{r, fig.width=8}
cloud_data <- toots %>% 
    mutate(id = row_number()) %>%         # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
    mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
    na.omit() %>%
    separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
    filter(!(word1 %in% c(new_stop_words, stop_words$word)),
           !(word2 %in% c(new_stop_words, stop_words$word))) %>%
    group_by(bigram) %>%
    count(sort = T) |>
  mutate(length = nchar(bigram)) |>
  filter(length < 15)
  
cloud_data
wordcloud(words = cloud_data$bigram, 
          freq = cloud_data$n, min.freq = 10,
          max.words = 35, random.order = FALSE, rot.per = 0.10, 
          colors = brewer.pal(8, "Dark2")) 
```


## Sentiment

This section is inspired from: https://www.tidytextmining.com/sentiment.html    
Sometimes, you may be asked in the process if you *really* want to download data (lexicons).  
Just say yes in the **console** (type the correct answer: if not, you will be blocked/struck).

First, we need to load some sentiment lexicon. AFINN is one such sentiment database. 

```{r}
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
afinn |> filter(value > 3)
```

To create a nice visualization, we need to extract the **time** of the tweets.

```{r}
tokens_time <- toots %>% 
    mutate(id = row_number()) %>%         # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(word, content)          # Creates tokens!
tokens_time
```

We then use **inner_join**() to merge the two sets. This function removes the cases when a match does not occur.

```{r}
library(lubridate)
sentiment <- tokens_time %>% 
    inner_join(afinn) %>%
    mutate(day = day(created_at),
           hour = hour(created_at) / 24,
           minute = minute(created_at) / 60 / 24,
           time = day + hour + minute)
sentiment
```

We then compute the average sentiment, minute-by-minute, or day-by-day, depending on frequency.   
Of course, average sentiment can be misleading. Indeed, if a text contains the terms "*I'm not happy*", then only "*happy*" will be tagged, which is the opposite of the intended meaning.

```{r}
sentiment %>%
  mutate(date = as.Date(created_at)) |>
    group_by(date) %>%
    #filter(year(date)==2024) |>
    summarise(avg_sentiment = mean(value)) %>%
    ggplot(aes(x = date, y = avg_sentiment)) + geom_col() + theme_bw()
```


What about emotions? The **NRC** lexicon categorizes *emotions*. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions. 

```{r, message = FALSE, warning = FALSE}
nrc <- get_sentiments("nrc")
nrc <- nrc %>%
    mutate(sentiment = as.factor(sentiment),
           sentiment = recode_factor(sentiment,
                                     joy = "joy",
                                     trust = "trust",
                                     surprise = "surprise",
                                     anticipation = "anticipation",
                                     positive = "positive",
                                     negative = "negative",
                                     sadness = "sadness",
                                     anger = "anger",
                                     fear = "fear",
                                     digust = "disgust",
                                     .ordered = T))
nrc
```

We then create the merged dataset.

```{r}
emotions <- tokens_time %>% 
    inner_join(nrc) %>%                     # Merge data with sentiment
    mutate(date = as.Date(created_at))      # Create day column
emotions                                    # Show the result
```

The merging has reduced the size of the dataset, but there still remains enough to pursue the study.   
Finally, we move to the pivot-table that counts emotions for each day.

```{r}
g <- emotions %>% 
    group_by(date, sentiment) %>%
    summarise(intensity = n()) %>%
    filter(year(date) == 2024) |>
    ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col() + 
    theme(axis.text.x = element_text(angle = 80, 
                                     size = 10,
                                     hjust = 1)) + xlab("Time") +
    scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw()
ggplotly(g)
```

This can also be shown in percentage format. 

```{r}
g <- emotions %>% 
    group_by(date, sentiment) %>%
    filter(year(date) == 2024) |>
    summarise(intensity = n()) %>%
    ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
    theme(axis.text.x = element_text(angle = 80, 
                                     size = 10,
                                     hjust = 1)) + xlab("Time") +
    scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw() + 
    geom_hline(yintercept = 0.5, linetype = 2)
ggplotly(g)
```

```{r}
emotions %>% 
    mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>% 
    group_by(date, sentiment) %>%
    summarise(intensity = n()) %>%
    ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 80, 
                                     size = 10,
                                     hjust = 1)) + xlab("Time") +
    geom_hline(yintercept = 0.5) + 
    scale_fill_manual(values = c("#223333", "#FFBB99")) 
```




## Advanced sentiment 

The problem with the preceding methods is that they don't take into account **valence shifters** (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says *not happy*, counting the word *happy* is not a good idea! The package *sentimentr* is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr  
(see also: https://www.sentometrics.org and the book **Supervised Machine Learning for Text Analysis in R** hosted at https://smltar.com)

I haven't tested **aws.comprehend**, but it seems promising: https://github.com/cloudyr/aws.comprehend

```{r, warning = FALSE, message = FALSE}
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
```

First, let's keep only the tweets written in English!

```{r}
# toots_en <- toots %>%
#     mutate(language = textcat(content)) %>%
#     filter(language == "english") %>%
#     dplyr::select(created_at, content)

toots_en <- toots |> filter(language == "en")
```

**NOTE**: the code above was used to show the function *textcat*: the language is already coded in the tweets via the **lang** column/variable. (it suffices to keep the instances for which lang == "en")

Next, we compute advanced sentiment. 

```{r}
tweet_sent <- toots_en$content %>%
    get_sentences() %>%  # Intermediate function
    sentiment()          # Sentiment!
tweet_sent
```

**NOTE**: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant. 

```{r}
toots_en %>%
    rowid_to_column("element_id") # This creates a new column with row number

toots_en %>%
    rowid_to_column("element_id") %>%
    left_join(tweet_sent, by = "element_id")

toots_en %>%
    rowid_to_column("element_id") %>%
    left_join(tweet_sent, by = "element_id") %>%
    group_by(date = make_date(day = day(created_at), 
                              month = month(created_at),
                              year = year(created_at))) %>%
    summarise(avg_sent = mean(sentiment)) %>%
    ggplot(aes(x = date, y = avg_sent)) + geom_col() 

toots_en %>%
    rowid_to_column("element_id") %>%
    left_join(tweet_sent, by = "element_id") %>%
    filter(sentiment != 0) %>%
    ggplot(aes(x = as.factor(hour(created_at)), y = sentiment)) + 
    geom_hline(yintercept = 0) +
    geom_jitter(size = 0.2) +
    geom_boxplot(aes(color = as.factor(hour(created_at))), alpha = 0.5) +
    theme_bw() + 
    theme(legend.position = "none") + xlab("hour") 
```



# Resources

Below, a short list of resources (to access third-party data):   

- **text mining with R** (online book): https://www.tidytextmining.com      
- **Bloomberg**: https://cran.r-project.org/web/packages/Rblpapi/index.html   
- **gmail**: https://cran.r-project.org/web/packages/gmailr/vignettes/gmailr.html   
- **Google Maps**: https://cran.rstudio.com/web/packages/mapsapi/vignettes/intro.html  
- **Google trends**: https://github.com/PMassicotte/gtrendsR
- **Google APIs** (more generally): https://cran.r-project.org/web/packages/gargle/vignettes/auth-from-web.html
- **Facebook API**: developers.facebook.com/ads/blog/post/v2/2018/05/15/facebook-reach-frequency-api/  

Possibly deprecated:  
- **Facebook**: https://cran.r-project.org/web/packages/Rfacebook/index.html    
- **Instagram**: https://cran.r-project.org/web/packages/instaR/index.html

```{r}

```
