15  Text Analysis 1

Published

November 6, 2024

Keywords

text analysis, tokens, bag of words, tidytext, frequency analysis, sentiment analysis

15.1 Introduction

15.1.1 Learning Outcomes

  • Create strategies for analyzing text.
  • Manipulate and analyze text data from a variety of sources using the {tidytext} package for …
    • Frequency Analysis
    • Relationships Among Words
    • Sentiment Analysis
  • Build Word Cloud plots.

15.1.2 References:

15.1.2.1 Other References

15.2 Text Analysis with {tidytext}

Text Mining can be considered as a process for extracting insights from text.

  • Computer-based Text Mining has been around since the 1950s with automated translations, or the 1940s if you want to consider computer-based code-breaking Trying to Break Codes.

The CRAN Task View: Natural Language Processing (NLP) lists over 50 packages focused on gathering, organizing, modeling, and analyzing text.

In addition to text mining or analysis, NLP has multiple areas of research and application.

  1. Machine Translation: translation without any human intervention.
  2. Speech Recognition: Alexa, Hey Google, Siri, … understanding your questions.
  3. Sentiment Analysis: also known as opinion mining or emotion AI.
  4. Question Answering: Alexa, Hey Google, Siri, … answering your questions so you can understand.
  5. Automatic Summarization: Reducing large volumes to meta-data or sensible summaries.
  6. Chat bots: Combinations of 2 and 4 with short-term memory and context for specific domains.
  7. Market Intelligence: Automated analysis of your searches, posts, tweets, ….
  8. Text Classification: Automatically analyze text and then assign a set of pre-defined tags or categories based on its content e.g., organizing and determining relevance of reference material
  9. Character Recognition.
  10. Spelling and Grammar Checking.

Text Analysis/Natural Language processing is a basic technology for generative AIs What is generative AI?.

15.3 Organizing Text for Analysis and Tidy Text Format

There are multiple ways to organize text for analysis:

  • strings: character data in atomic vectors or lists (data frames)
  • corpus: a library of documents structured as strings with associated metadata, e.g., the source book or document
  • Document-Term Matrix (DTM): a matrix with a row for each document and a column for every unique term or word across every document (i.e., across all rows).
    • The entries are generally counts or tf-idf (term frequency - inverse document freq) scores for the column’s word in the row’s document.
    • With multiple rows, there are a lot of 0s, so usually stored as a sparse matrix.
    • The Term-Document Matrix (TDM) is the transpose of the DTM.

We will focus on organizing bodies of text into Tidy Text Format (TTF).

  • Tidy Text Format requires organizing text into a tibble/data frame with the goal of speeding analysis by allowing use of familiar tidyverse constructs.

In a TTF tibble, the text is organized so as to have one token per row.

  • A token is a meaningful unit of text where you decide what is meaningful to your analysis.
  • A token can be a word, an n-gram (multiple words), a sentence, a paragraph, or even larger sets up to whole chapters or books..

The simplest approach is analyzing single words or n-grams without any sense of syntax or order connecting them to each other.

  • This is often called a “Bag of Words” as each token is treated independently of the other tokens in the document; only the counts or tf-idfs matter.

More sophisticated methods are now using neural word embeddings where the words are encoded into vectors that attempt to capture (through training) the context from other words in the document (usually based on physical or semantic distance.

15.3.1 General Concepts and Language Specifics

We will be only looking at text analysis for the English language.

  • The techniques may be similar for many other proto-indo-european languages that have similar structure.

While the concepts we will use apply to other languages, it can be more complex.

Research is continuing with other languages, e.g., the release of a multi-lingual version of BERT.

15.4 The {tidytext} package

The {tidytext} package contains many functions to support text mining for word processing and sentiment analysis.

  • It is designed to work well with other tidyverse packages such as {dplyr} and {ggplot2}.
  • Use the console to install the package and then load {tidyverse} and {tidytext}.
library(tidyverse)
library(tidytext)

15.4.1 Let’s Organize Text into Tidy Text Format

Example 1: A famous love poem by Pablo Neruda.

Read in the following text from the first stanza.

text <- c(
  "If You Forget Me",
  "by Pablo Neruda",
  "I want you to know",
  "one thing.",
  "You know how this is:",
  "if I look",
  "at the crystal moon, at the red branch",
  "of the slow autumn at my window,",
  "if I touch",
  "near the fire",
  "the impalpable ash",
  "or the wrinkled body of the log,",
  "everything carries me to you,",
  "as if everything that exists,",
  "aromas, light, metals,",
  "were little boats",
  "that sail",
  "toward those isles of yours that wait for me."
)
text
 [1] "If You Forget Me"                             
 [2] "by Pablo Neruda"                              
 [3] "I want you to know"                           
 [4] "one thing."                                   
 [5] "You know how this is:"                        
 [6] "if I look"                                    
 [7] "at the crystal moon, at the red branch"       
 [8] "of the slow autumn at my window,"             
 [9] "if I touch"                                   
[10] "near the fire"                                
[11] "the impalpable ash"                           
[12] "or the wrinkled body of the log,"             
[13] "everything carries me to you,"                
[14] "as if everything that exists,"                
[15] "aromas, light, metals,"                       
[16] "were little boats"                            
[17] "that sail"                                    
[18] "toward those isles of yours that wait for me."

Let’s get some basic info about our text.

  • Check the length of the vector.
  • Use map() to check the number of characters in each element.
  • Use map_dbl() to count the number of words in each element and total number of words.
  • Use map_dbl() to count the total number of words.
length(text)
[1] 18
map_dbl(text, str_length)
 [1] 16 15 18 10 21  9 38 32 10 13 18 32 29 29 22 17  9 45
map_dbl(text, ~ str_count(., "\\w+"))
 [1] 4 3 5 2 5 3 8 7 3 3 3 7 5 5 3 3 2 9
sum(map_dbl(text, ~ str_count(., "\\w+")))
[1] 80
  • You get a character variable of length 18 with 80 words.
  • Each element has different numbers of words and letters.

This is not a tibble so it can’t be tidy text format with one token per row.

We’ll go through a number of steps to gradually transform the text vector to tidy text format and then clean it so we can analyze it.

15.4.1.1 Convert the text Vector into a Tibble

Convert text into a tibble with two columns:

  • Add a column line with the “line number” from the poem for each row based on the position in the vector.
  • Add a column text with the each element of the vector in its own row.
    • Adding a column of indices for each token is a common technique to track the original structure.
text_df <- tibble(
  line = seq_len(length(text)),
  text = text
)
head(text_df, 10)
# A tibble: 10 × 2
    line text                                  
   <int> <chr>                                 
 1     1 If You Forget Me                      
 2     2 by Pablo Neruda                       
 3     3 I want you to know                    
 4     4 one thing.                            
 5     5 You know how this is:                 
 6     6 if I look                             
 7     7 at the crystal moon, at the red branch
 8     8 of the slow autumn at my window,      
 9     9 if I touch                            
10    10 near the fire                         

15.4.1.2 Convert the Tibble into Tidy Text Format with unnest_tokens()

The function unnest_tokens(text_df) converts a column of text from a data frame into tidy text format.

  • Look at help for unnest_tokens(), not the older unnest_tokens_().
  • The first argument, tbl, is the input tibble so piping works.
  • The order might be un-intuitive as output is next, followed by the input column.

Like unnesting list columns, unnest_tokens() splits each element (row) in the column into multiple rows with a single token.

  • The value of the token is based on the value of the argument token = which recognizes multiple options
  • “words” (default), “characters”, “character_shingles”, “ngrams”, “skip_ngrams”, “sentences”, “lines”, “paragraphs”, “regex”, “tweets” (tokenization by word that preserves usernames, hashtags, and URLS), and “ptb” (Penn Treebank).
  • The default for the token = argument is “words”.
unnest_tokens(
  tbl = text_df,
  output = word,
  input = text
) |>
  head(10)
# A tibble: 10 × 2
    line word  
   <int> <chr> 
 1     1 if    
 2     1 you   
 3     1 forget
 4     1 me    
 5     2 by    
 6     2 pablo 
 7     2 neruda
 8     3 i     
 9     3 want  
10     3 you   
 # or use the pipe
text_df |>
  unnest_tokens(
    output = word,
    input = text
  ) |>
  head(10)
# A tibble: 10 × 2
    line word  
   <int> <chr> 
 1     1 if    
 2     1 you   
 3     1 forget
 4     1 me    
 5     2 by    
 6     2 pablo 
 7     2 neruda
 8     3 i     
 9     3 want  
10     3 you   
  • This converts the data frame to 80 rows with a one-word token in each row.
  • Punctuation has been stripped.
  • By default, unnest_tokens() converts the tokens to lowercase.
    • Use the argument to_lower = FALSE to retain case.

15.4.2 Remove Stop Words with an anti_join() on stop_words

We can see a lot of common words in the text such as “I”, “the”, “and”, “or”, ….

These are called stop words: extremely common words not useful for some types of text analysis.

  • Use data() to load the {tidytext} package’s built-in data frame called stop_words.
  • stop_words draws on three different lexicons to identify 1,149 stop words (see help).

Use anti_join() to remove the stop words (a filtering join that removes all rows from x where there are matching values in y).

Save to a new tibble.

  • How many rows are there now?
data(stop_words)
text_df |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word") |> ## get rid of uninteresting words
  count(word, sort = TRUE) -> ## count of each word left
text_word_count
text_word_count
# A tibble: 26 × 2
   word        n
   <chr>   <int>
 1 aromas      1
 2 ash         1
 3 autumn      1
 4 boats       1
 5 body        1
 6 branch      1
 7 carries     1
 8 crystal     1
 9 exists      1
10 fire        1
# ℹ 16 more rows
nrow(text_word_count) ## note: only 26 rows instead of 80
[1] 26

These are the basic steps to get your text ready for analysis:

  1. Convert text to a tibble, if not already in one, with a column for the text and an index column with row number or other location indicators.
  2. Convert the tibble to Tidy Text format using unnest_tokens() with the appropriate arguments.
  3. Remove stop words if appropriate (sometimes we need to keep them as we will see later).
  4. Save to a new tibble.

15.5 Tidytext Example 2: Jane Austen’s Books and the {janeaustenr} Package

Let’s look at a larger set of text, all six major novels written by Jane Austen in the early 19th century.

The {janeaustenr} package has this text already in a data frame based on the free content in the Project Gutenberg Library.

Use the console to install the package and then load it in your file.

library(janeaustenr)

15.5.1 Get the Data for the Corpus of Six Books and Add Metadata

Use the function austen_books() to access the data frame of the six books.

The data frame has two columns:

  • text contains the text of the novels divided into elements of up to about 70 characters each.
  • book contains the titles of the novels as a factor, with the levels in order of publication.

We want to track the chapters in the books.

Let’s use REGEX to see how the different books indicate their chapters.

austen_books() |>
  head(20)
# A tibble: 20 × 2
   text                                                                    book 
   <chr>                                                                   <fct>
 1 "SENSE AND SENSIBILITY"                                                 Sens…
 2 ""                                                                      Sens…
 3 "by Jane Austen"                                                        Sens…
 4 ""                                                                      Sens…
 5 "(1811)"                                                                Sens…
 6 ""                                                                      Sens…
 7 ""                                                                      Sens…
 8 ""                                                                      Sens…
 9 ""                                                                      Sens…
10 "CHAPTER 1"                                                             Sens…
11 ""                                                                      Sens…
12 ""                                                                      Sens…
13 "The family of Dashwood had long been settled in Sussex.  Their estate" Sens…
14 "was large, and their residence was at Norland Park, in the centre of"  Sens…
15 "their property, where, for many generations, they had lived in so"     Sens…
16 "respectable a manner as to engage the general good opinion of their"   Sens…
17 "surrounding acquaintance.  The late owner of this estate was a single" Sens…
18 "man, who lived to a very advanced age, and who for many years of his"  Sens…
19 "life, had a constant companion and housekeeper in his sister.  But he… Sens…
20 "death, which happened ten years before his own, produced a great"      Sens…

Chapters start on their own line it appears.

austen_books() |>
  filter(str_detect(text, "(?i)^chapter")) |> # Case insensitive
  slice_sample(n = 10)
# A tibble: 10 × 2
   text          book               
   <chr>         <fct>              
 1 CHAPTER XVIII Emma               
 2 CHAPTER 22    Sense & Sensibility
 3 Chapter 26    Pride & Prejudice  
 4 CHAPTER 30    Northanger Abbey   
 5 CHAPTER XL    Mansfield Park     
 6 CHAPTER 23    Sense & Sensibility
 7 CHAPTER II    Emma               
 8 CHAPTER 30    Sense & Sensibility
 9 Chapter 15    Persuasion         
10 Chapter 19    Pride & Prejudice  
  • Chapters start with the word chapter in both upper and sentence case followed by a space then the chapter number in either Arabic or Roman numerals.

Let’s add some metadata to keep track of things when we convert to tidy text format.

  • Group by book.
  • Add an index column with a row number for the rows from each book (they are grouped).
  • Add an index column with the number of the chapter.

Use stringr::regex() with argument ignore_case = TRUE.

  • regex() is a {stringr} modifier function with options for how to modify the regex pattern.
  • See help for modifiers. For information on line terminators see Regular-expression constructs.

Save to a new data_frame with line number, text, and book.

austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlc]",
        ignore_case = TRUE
      )
    )),
    .before = text
  ) |>
  ungroup() |>
  select(book, chapter, linenumber, text) ->
orig_books
head(orig_books)
# A tibble: 6 × 4
  book                chapter linenumber text                   
  <fct>                 <int>      <int> <chr>                  
1 Sense & Sensibility       0          1 "SENSE AND SENSIBILITY"
2 Sense & Sensibility       0          2 ""                     
3 Sense & Sensibility       0          3 "by Jane Austen"       
4 Sense & Sensibility       0          4 ""                     
5 Sense & Sensibility       0          5 "(1811)"               
6 Sense & Sensibility       0          6 ""                     
nrow(orig_books)
[1] 73422
sum(map_dbl(orig_books$text, ~ str_count(., "\\w+")))
[1] 729533

We can now see the book, chapter, and line number for each of the 73,422 text elements with almost 730K (non-unique) individual words.

15.5.2 Convert to Tidy Text Format, Clean, and Sort the Counts

  1. Unnest the text with the tokens being each word.
  2. Clean the words to remove any formatting characters.
    • Project Gutenberg uses pairs of formatting characters, before and after a word, to denote bold or italics e.g., “_myword_” means myword.
    • We want to extract just the words without any formatting symbols.
  3. Remove stop words.
  4. Save to a new tibble.

Look at the number of rows and the counts for each unique word.

orig_books |>
  unnest_tokens(word, text) |> ## nrow() #725,055
  ## use str_extract to get just the words inside any format encoding
  mutate(word = str_extract(word, "[a-z']+")) |>
  anti_join(stop_words, by = "word") -> ## filter out words in stop_words
tidy_books

nrow(tidy_books)
[1] 216385
tidy_books |>
  count(word)
# A tibble: 13,464 × 2
   word          n
   <chr>     <int>
 1 a'n't         1
 2 abandoned     1
 3 abashed       1
 4 abate         2
 5 abatement     4
 6 abating       1
 7 abbey        71
 8 abbeyland     1
 9 abbeys        2
10 abbots        1
# ℹ 13,454 more rows
length(unique(tidy_books$word))
[1] 13464
tidy_books |>
  count(word, sort = TRUE)
# A tibble: 13,464 × 2
   word       n
   <chr>  <int>
 1 miss    1860
 2 time    1339
 3 fanny    862
 4 dear     822
 5 lady     819
 6 sir      807
 7 day      797
 8 emma     787
 9 sister   727
10 house    699
# ℹ 13,454 more rows
  • There are 216,385 instances of 13,464 unique (non-stop word) words across the six books.

The data are now in tidy text format and ready to analyze!

15.5.3 Plot the Most Common Words

Let’s plot the “most common” words (defined for now as more than 500 occurrences) in descending order by count.

tidy_books |>
  count(word, sort = TRUE) |>
  filter(n > 500) |>
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

  1. Plot the most common words in descending order by count while using color to indicate the counts for each book.
Show code
tidy_books |>
  group_by(book) |>
  count(word, sort = TRUE) |>
  group_by(word) |>
  mutate(word_total = sum(n)) |>
  ungroup() |>
  filter(word_total > 500) |> ## 370
  mutate(word = fct_reorder(word, word_total)) |>
  ggplot(aes(word, n, fill = book)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

  1. Find the words that occur the most in each book but that do not occur in any other book.
    • Hint: Consider using pivot_wider() to create a temporary data frame with the counts for each book.
  • Then, check how many books a word does not appear in, and filter to those that do not appear in five books.
    • Hint: consider using the magrittr pipe to be able to use the . pronoun.
  • Then, pivot_longer() to get back to one column with the book names.
Show code
tidy_books |>
  group_by(book) |>
  count(word, sort = TRUE) |>
  ungroup() |>
  pivot_wider(names_from = book, values_from = n) %>% # view()
  mutate(tot_books = is.na(.$`Mansfield Park`) +
    is.na(.$`Sense & Sensibility`) +
    is.na(.$`Pride & Prejudice`) +
    is.na(.$`Emma`) +
    is.na(.$`Northanger Abbey`) +
    is.na(.$`Persuasion`)) |>
  filter(tot_books == 5) |>
  select(-tot_books) |>
  pivot_longer(-word,
    names_to = "book", values_to = "count",
    values_drop_na = TRUE
  ) |>
  group_by(book) |>
  filter(count == max(count)) |>
  arrange(desc(count))
# A tibble: 6 × 3
# Groups:   book [6]
  word     book                count
  <chr>    <chr>               <int>
1 elinor   Sense & Sensibility   623
2 crawford Mansfield Park        493
3 weston   Emma                  389
4 darcy    Pride & Prejudice     374
5 elliot   Persuasion            254
6 tilney   Northanger Abbey      196
Show code
# note Emma occurs once in Persuasion
  • How would you change your code if you did not know how many books there were or there were many books?
Show code
## Without knowing how many books or titles
tidy_books |>
  group_by(book) |>
  count(word, sort = TRUE) |>
  ungroup() |>
  pivot_wider(names_from = book, values_from = n) |>
  mutate(across(where(is.numeric), is.na, .names = "na_{ .col}")) |>
  rowwise() |>
  mutate(tot_books = sum(c_across(starts_with("na")))) |>
  ungroup() |> ## have to ungroup after rowwise
  filter(tot_books == max(tot_books)) |>
  select(!(starts_with("na_") | starts_with("tot"))) |>
  pivot_longer(-word,
    names_to = "book", values_to = "count",
    values_drop_na = TRUE
  ) |>
  group_by(book) |>
  filter(count == max(count)) |>
  arrange(desc(count))
# A tibble: 6 × 3
# Groups:   book [6]
  word     book                count
  <chr>    <chr>               <int>
1 elinor   Sense & Sensibility   623
2 crawford Mansfield Park        493
3 weston   Emma                  389
4 darcy    Pride & Prejudice     374
5 elliot   Persuasion            254
6 tilney   Northanger Abbey      196

15.6 Compare Frequencies across Authors

Let’s compare Jane Austen to two other writers:

  • H.G. Wells a science fiction writer (The Island of Doctor Moreau, The War of the Worlds, The Time Machine, and The Invisible Man).
  • The Bronte Sisters (Jane Eyre, Wuthering Heights, Agnes Grey, The Tenant of Wildfell Hall and Villette) who are from Jane Austen’s era and genre.

Let’s compare Austen to the others based on how often each used specific words (non-stop words).

As a strategy, consider the following steps:

  1. Identify several books from the two new authors so we have a reasonable data set.
    • Use Project Gutenberg and the {gutenbergr} package.
  2. Download and transform each author’s books into their own tibble in tidy text format.
    • Remove formatting and stop words.
  3. Add author to each tibble and combine into one tibble.
  4. For each author get the relative frequencies of word usage.

Now we have to consider how to get the data into a form that is easy for comparison.

  • Consider using scatter plots to comparing Austen against Bronte and then Austen against Wells.
  • That suggests reshaping the data frame so that Austen’s frequencies are in one column and Bronte and Wells are in a second column, say author, so we can facet on author.
  • To facilitate the comparison, we can add a geom_abline() where the frequencies are equal.

To complete our strategy:

  1. Reshape the Tibble.
    • Pivot wider to break out each author into three columns.
    • Pivot longer to combine Bronte and Wells into one author column.
  2. Plot the relative frequencies for Austen versus the other author
    • Use a scatter plot.
    • Add a default geom_abline().
    • Facet on author.
  3. Interpret the plots.
  4. Use cor.test() to test the correlations.

15.6.1 Identify works for each new author

15.6.1.1 Project Gutenberg and the {gutenbergr} package

We’ll use Project Gutenberg as our source.

The {gutenbergr} package includes metadata for 70K Project Gutenberg works, so they can be searched and retrieved.

  • These are works in the public domain (published over 95 years ago) that have been digitized and uploaded by volunteers.

Use the console to install the package if necessary and load the library in your file.

  • You will need to use devtools::install_github("ropensci/gutenbergr").
library(gutenbergr)

15.6.1.2 Find the gutenberg_ID for each work

Example: Frankenstein has gutenberg_ID = 84, so use gutenberg_download(84).

To find a work’s gutenberg_ID, use function gutenberg_works().

  • You can search on the “exact title” (as used in Project Gutenberg) or,
  • Look for the author in the author metadata gutenberg_authors data frame and then use the gutenberg_authors_id to find the work IDs for the author in gutenberg_works().
gutenberg_works() |>
  filter(title == "Wuthering Heights")
# A tibble: 1 × 8
  gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
         <int> <chr>     <chr>                <int> <chr>    <chr>              
1          768 Wutherin… Bront…                 405 en       Best Books Ever Li…
# ℹ 2 more variables: rights <chr>, has_text <lgl>
## or use str_detect
gutenberg_works() |>
  filter(str_detect(title, "Wuthering Heights")) |>
  head()
# A tibble: 2 × 8
  gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
         <int> <chr>     <chr>                <int> <chr>    <chr>              
1          768 "Wutheri… Bront…                 405 en       "Best Books Ever L…
2        40655 "The Key… Malha…               40751 en       ""                 
# ℹ 2 more variables: rights <chr>, has_text <lgl>

As an alternative, find the author’s ID and then the work IDs.

gutenberg_authors[(str_detect(gutenberg_authors$author, "Wells")), ]
# A tibble: 30 × 7
   gutenberg_author_id author        alias birthdate deathdate wikipedia aliases
                 <int> <chr>         <chr>     <int>     <int> <chr>     <chr>  
 1                  30 Wells, H. G.… Well…      1866      1946 https://… Wells,…
 2                 135 Brown, Willi… <NA>         NA      1884 https://… Brown,…
 3                1060 Wells, Carol… Houg…      1862      1942 https://… Hought…
 4                3499 Wells, Phili… <NA>       1868      1929 <NA>      Wells,…
 5                4952 Wells, J. (J… Well…      1855      1929 https://… Wells,…
 6                5122 Dall, Caroli… <NA>       1822      1912 https://… Healey…
 7                5765 Wells-Barnet… <NA>       1862      1931 https://… Wells,…
 8                6158 Hastings, We… Hast…      1879      1923 <NA>      Hastin…
 9                7102 Wells, Frede… <NA>       1874      1929 <NA>      <NA>   
10               32091 Reeder, Char… <NA>       1884        NA <NA>      <NA>   
# ℹ 20 more rows
gutenberg_works(gutenberg_author_id == 30) |>
  arrange(title) |>
  mutate(stitle = str_trunc(title, 40)) |> ## there are some very long titles.
  select(stitle, gutenberg_id) |>
  filter(str_detect(stitle, "Moreau")) ## if there are lots of titles
# A tibble: 1 × 2
  stitle                      gutenberg_id
  <chr>                              <int>
1 The island of Doctor Moreau          159

15.6.2 Download and Convert the texts for Wells and Bronte into TTF

Use gutenberg_download() to download one or more works from Project Gutenberg.

  • Wells’ IDs are: (35, 36, 159, 5230).
  • Bronte’s IDs are: (767, 768, 969, 1260, 9182).

Put each in their own tibble, in Tidy Text format, and remove formatting and stop words.

gutenberg_download(c(35, 36, 159, 5230)) |> ## hgwells
  unnest_tokens(word, text) |>
  mutate(word = str_extract(word, "[a-z']+")) |>
  anti_join(stop_words, by = "word") ->
tidy_hgwells

gutenberg_download(c(767, 768, 969, 1260, 9182)) |> ## brontes
  unnest_tokens(word, text) |>
  mutate(word = str_extract(word, "[a-z']+")) |>
  anti_join(stop_words, by = "word") ->
tidy_bronte

tidy_hgwells |>
  count(word, sort = TRUE)
# A tibble: 10,047 × 2
   word          n
   <chr>     <int>
 1 time        397
 2 people      250
 3 kemp        245
 4 door        226
 5 invisible   197
 6 black       178
 7 hall        176
 8 stood       174
 9 night       170
10 heard       167
# ℹ 10,037 more rows
tidy_bronte |>
  count(word, sort = TRUE)
# A tibble: 21,801 × 2
   word       n
   <chr>  <int>
 1 time    1066
 2 miss     858
 3 day      843
 4 don      786
 5 hand     768
 6 eyes     714
 7 night    661
 8 heart    654
 9 looked   602
10 door     591
# ℹ 21,791 more rows

15.6.3 Add author to each tibble and combine into one tibble

Add the authors name as a new variable in each tibble.

  • Bind (combine) the three data frames of cleaned words into a single data frame.
  • Get the word counts by author.
  • Create a variable with the relative frequency each author uses a word .
  • Drop the count variable n.
  • Save to a new data frame.
bind_rows(
  mutate(tidy_bronte, author = "Bronte"),
  mutate(tidy_hgwells, author = "Wells"),
  mutate(tidy_books, author = "Austen")
) ->
  author_tibble

15.6.4 For each author get the relative frequencies of word usage

author_tibble |> 
count(author, word) |> ## head(20)
  group_by(author) |>
  mutate(proportion = n / sum(n)) |>
  select(-n) |> 
  ungroup()->
freq_by_author_by_word

arrange(freq_by_author_by_word, word) |> 
  slice_sample(n = 10)
# A tibble: 10 × 3
   author word          proportion
   <chr>  <chr>              <dbl>
 1 Bronte judiciously   0.0000279 
 2 Austen throws        0.00000462
 3 Bronte wheedle       0.0000120 
 4 Bronte footlights    0.00000399
 5 Bronte honeyed       0.0000120 
 6 Wells  command       0.0000583 
 7 Bronte adust         0.00000399
 8 Austen indisposition 0.0000693 
 9 Bronte platefuls     0.00000399
10 Wells  basket        0.000155  

We now have each authors relative frequency for each (non-stop) word.

15.6.5 Reshape the Tibble

We want to compare Austen’s word frequency against each of the others.

Let’s reshape the tibble so Austen is in one column and the other two are in a combined column (so we can facet).

  • Use pivot_wider() to break out each author into their own column.
  • Use pivot_longer() to combine Bronte and Wells into an author column.
  • Save to a tibble called frequency.
  • This gives us two rows per word, one for Bronte and one for Wells.
freq_by_author_by_word |>
  pivot_wider(names_from = author, values_from = proportion) ->
frequency_by_word_across_authors

head(frequency_by_word_across_authors)
# A tibble: 6 × 4
  word          Austen      Bronte     Wells
  <chr>          <dbl>       <dbl>     <dbl>
1 a'n't     0.00000462 NA          NA       
2 abandoned 0.00000462  0.0000917   0.000194
3 abashed   0.00000462  0.0000159  NA       
4 abate     0.00000924  0.0000120  NA       
5 abatement 0.0000185  NA          NA       
6 abating   0.00000462  0.00000797 NA       
frequency_by_word_across_authors |>
  pivot_longer(Bronte:Wells,
    names_to = "author",
    names_ptypes = list(author = factor()),
    values_to = "proportion"
  ) ->
frequency

arrange(frequency, word) |> 
  slice_sample(n = 10)
# A tibble: 10 × 4
   word                 Austen author  proportion
   <chr>                 <dbl> <fct>        <dbl>
 1 forage          NA          Wells  NA         
 2 confined         0.000171   Wells   0.0000583 
 3 dispassionately NA          Wells  NA         
 4 auriculas       NA          Wells  NA         
 5 whistle         NA          Wells   0.0000194 
 6 crow's           0.00000462 Wells  NA         
 7 snows            0.00000462 Bronte  0.0000120 
 8 mistless        NA          Bronte  0.00000399
 9 babioles        NA          Wells  NA         
10 transmute       NA          Bronte  0.00000399

The tibble has each word and the relative frequency for Austen and the other two authors (if it was used by them).

Now we can compare each to Austen to author and facet on author.

15.6.6 Plot the relative frequencies for Austen versus the other author

Plot Austen’s proportion on the y axis and the other author’s proportions on the x axis.

  • We’ll use log10 scales for both x and y.
  • Facet by author to break out Wells and Bronte compared to Jane Austen.
  • The {scales} package can help us customize the plot using percent_format().
library(scales)
frequency |>
  filter(!is.na(Austen)) |>
  ggplot(aes(
    x = proportion, y = Austen,
    color = abs(Austen - proportion)
  )) +
  geom_abline(color = "red", lty = 3) +
  geom_jitter(alpha = 0.03, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_colour_gradient(low = "mediumblue", high = "green", na.value = NA) +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position = "none") +
  labs(y = "Jane Austen", x = NULL) +
  ggtitle("Relative Word Frequencies Compared to Jane Austen")

15.6.7 Interpret the plots

Words above the y = x abline are ones Austen used more frequently.

Words on the y = x abline are words the authors used with the same frequency .

Words below the y = x abline are words Bronte or Wells used more.

The more linear and narrow the plotting, the more similar the authors in terms of words and their frequency of usage.

It looks like Austen and Bronte are more similar (grouped closer to the line) than Austen and Wells.

It also appears Bronte had far more rare (low_frequency) words than Wells - why might that be?

  • Consider how many words are in their books.

15.6.8 Compare using a Correlation Test

Extract the word frequencies for Bronte and Wells individually.

  • Also compare Wells and Bronte which we did not plot.

  • Use cor.test() to compare Austen to Bronte and then to Wells.

    • Create a helper function to clean up the code.
  • Tidy and bind rows with the estimate and confidence interval.

df_Bronte <- frequency[frequency$author == "Bronte", ]
df_Wells <- frequency[frequency$author == "Wells", ]

test_cor <- \(df) {
  cor.test(data = df, ~ proportion + `Austen`, method = "pearson") |> 
    broom::tidy(conf.int = TRUE)
}

bind_rows(
  test_cor(df_Bronte),
  test_cor(df_Wells),
  cor.test(
    frequency$proportion[frequency$author == "Bronte"],
    frequency$proportion[frequency$author == "Wells"]) |> 
    broom::tidy(conf.int = TRUE)
) |> 
  select(estimate, conf.low, conf.high) |> round(2)
# A tibble: 3 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1     0.76     0.75      0.77
2     0.42     0.4       0.44
3     0.62     0.61      0.64

All three correlations are far from 0. It is interesting that Bronte and Wells are closer than Austen and Wells.

We have just gone through how to organize text into Tidy Text Format with a single token (word) per row in a tibble.

We have also downloaded texts from The Gutenberg Project Library and used frequency analysis for non-stop words to compare multiple authors.

Now we will look at sentiment analysis of blocks of text.

15.7 Sentiment Analysis

15.7.1 Overview

When humans read text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.

  • Especially when authors are “showing not saying” the emotional context

Sentiment Analysis (also known as opinion mining) uses computer-based text analysis, or other methods to identify, extract, quantify, and study affective states and subjective information from text.

  • Commonly used by businesses to analyze customer comments on products or services.

The simplest approach: get the sentiment of each word as a individual token and add them up across a given block of text.

  • This “bag of words” approach does not take into account word qualifiers or modifiers such as, in English, not, never, always, etc..
  • If we were add up the total positive and negative words across many paragraphs, the positive and negative words will tend to cancel each other out.

We are usually better off using tokens at either the sentence level, or by paragraph, and adding up positive and negative words at that level of aggregation.

This provides more context than the “bag of words” approach.

15.7.2 Sentiment Lexicons Assign Sentiments to Words (based on “common” usage)

15.7.2.1 Why multiple lexicons?

There are several sentiment lexicons available for use in text analysis.

  • Some are specific to a domain or application.
  • Some focus on specific periods of time as words change meaning over time due to semantic drift or semantic change so comparing sentiments of documents from two different eras may require different sentiment lexicons.
  • This is especially true for spoken or informal writing and over longer periods. See Semantic Changes in the English Language.

15.7.2.2 {tidytext} has functions to access three common lexicons in the {textdata} package

  • bing from Bing Liu. Collaborators assigns words as positive or negative.
    • bing is also the sentiments data frame in tidytext.
  • AFINN from Finn Arup Nielsen assigns words values from -5 to +5.
  • nrc from Saif Mohammad and Peter Turney assigns one of ten emotions to each word.
    • Note: a word may have more than one sentiment and many do …

We usually just pick one of the three for a given analysis.

15.7.2.3 Accessing Sentiments in {tidytext}

We can use get_sentiments() to load the sentiment of interest.

Install the {textdata} package using the console and then load it with library(textdata).

library(textdata)
sentiments |>
  arrange(word) |>
  slice_head(n = 10) # bing
# A tibble: 10 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
get_sentiments("bing") |> slice_sample(n = 10)
# A tibble: 10 × 2
   word        sentiment
   <chr>       <chr>    
 1 ideally     positive 
 2 disgruntled negative 
 3 goodness    positive 
 4 screw-up    negative 
 5 picturesque positive 
 6 boisterous  negative 
 7 retreated   negative 
 8 impassioned positive 
 9 abominably  negative 
10 achievement positive 
get_sentiments("afinn") |> slice_sample(n = 10)
# A tibble: 10 × 2
   word         value
   <chr>        <dbl>
 1 torturing       -4
 2 jeopardy        -2
 3 nervous         -2
 4 glorious         2
 5 repulse         -1
 6 pleasant         3
 7 harming         -2
 8 dehumanizing    -2
 9 fraudulence     -4
10 stunning         4
get_sentiments("nrc") |> slice_sample(n = 10)
# A tibble: 10 × 2
   word        sentiment   
   <chr>       <chr>       
 1 maternal    anticipation
 2 riot        fear        
 3 overpriced  negative    
 4 undoubted   anticipation
 5 university  positive    
 6 stealthy    anticipation
 7 ghetto      negative    
 8 skeptical   negative    
 9 winner      positive    
10 legislature trust       
unique(get_sentiments("nrc")$sentiment) |>
  sort()
 [1] "anger"        "anticipation" "disgust"      "fear"         "joy"         
 [6] "negative"     "positive"     "sadness"      "surprise"     "trust"       
get_sentiments("nrc") |>
  group_by(word) |>
  summarize(nums = n()) |>
  filter(nums > 1) |>
  nrow() / nrow(get_sentiments("nrc")) ## % words more than 1 sentiment
[1] 0.2654268
## an extreme case
get_sentiments("nrc") |>
  filter(word == "feeling")
# A tibble: 10 × 2
   word    sentiment   
   <chr>   <chr>       
 1 feeling anger       
 2 feeling anticipation
 3 feeling disgust     
 4 feeling fear        
 5 feeling joy         
 6 feeling negative    
 7 feeling positive    
 8 feeling sadness     
 9 feeling surprise    
10 feeling trust       
nrow(get_sentiments("nrc"))
[1] 13872
get_sentiments("nrc") |>
  select(word) |>
  unique() |>
  nrow()
[1] 6453

15.7.3 Example: Using nrc “Fear’ Words

Since the nrc lexicon gives us emotions, we can look at just words labeled as “fear” if we choose.

Let’s get the Jane Austen books into tidy text format.

  • No need to filter out the stop words as we will be filtering on select “fear” words which do not include stop words.
austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlc]",
        ignore_case = TRUE
      )
    ))
  ) |>
  ungroup() |>
  ## use `word` as the output so the inner_join will match with the nrc lexicon
  unnest_tokens(output = word, input = text) ->
tidy_books

head(tidy_books)
# A tibble: 6 × 4
  book                linenumber chapter word       
  <fct>                    <int>   <int> <chr>      
1 Sense & Sensibility          1       0 sense      
2 Sense & Sensibility          1       0 and        
3 Sense & Sensibility          1       0 sensibility
4 Sense & Sensibility          3       0 by         
5 Sense & Sensibility          3       0 jane       
6 Sense & Sensibility          3       0 austen     

Save only the “fear” words from the nrc lexicon into a new data frame

Let’s look at just Emma and use an inner_join() to select only those rows in both Emma and the nrc “fear” data frame.

Then let’s count the number of occurrences of the “fear” words in Emma.

get_sentiments("nrc") |>
  filter(sentiment == "fear") ->
nrcfear

tidy_books |>
  filter(book == "Emma") |>
  inner_join(nrcfear, by = "word") |>
  count(word, sort = TRUE)
# A tibble: 364 × 2
   word         n
   <chr>    <int>
 1 doubt       98
 2 ill         72
 3 afraid      65
 4 marry       63
 5 change      61
 6 bad         60
 7 feeling     56
 8 bear        52
 9 creature    39
10 obliging    34
# ℹ 354 more rows

Looking at the words, it is not always clear why a word is a “fear” word and remember that words may have multiple sentiments associated with them in the lexicon.

How many words are associated with the other sentiments in nrc?

get_sentiments("nrc") |>
  group_by(sentiment) |>
  count()
# A tibble: 10 × 2
# Groups:   sentiment [10]
   sentiment        n
   <chr>        <int>
 1 anger         1245
 2 anticipation   837
 3 disgust       1056
 4 fear          1474
 5 joy            687
 6 negative      3316
 7 positive      2308
 8 sadness       1187
 9 surprise       532
10 trust         1230

Plot the number of fear words in each chapter for each Jane Austen book.

  • Consider using scales = "free_x" in facet_wrap().
Show code
tidy_books |>
  inner_join(nrcfear, by = "word") |>
  group_by(book, chapter) |>
  count() ->
fear_chapter

head(fear_chapter)
# A tibble: 6 × 3
# Groups:   book, chapter [6]
  book                chapter     n
  <fct>                 <int> <int>
1 Sense & Sensibility       1    21
2 Sense & Sensibility       2    12
3 Sense & Sensibility       3    21
4 Sense & Sensibility       4    27
5 Sense & Sensibility       5    14
6 Sense & Sensibility       6     8
Show code
fear_chapter |>
  ggplot(aes(chapter, n)) +
  geom_line() +
  facet_wrap(~book, scales = "free_x")

15.7.3.1 Looking at Larger Blocks of Text for Positive and Negative

Let’s break up tidy_books into larger blocks of text, say 80 lines long.

We can use the bing lexicon (either positive or negative) to categorize each word within a block.

  • Recall, the words in tidy_books are in sequential order by line number.

Steps

  • Use inner_join() to filter out words in tidy_text not in bing while adding the sentiment column from bing
  • Use count(), and inside the call, create an index variable for the 80-line block of text source for the word while keeping book and sentiment variables
    • Use index = line_number %/% 80
    • Note, most blocks will have far fewer than 80 words since we are only keeping the words that are in bing.
  • Use pivot_wider() on sentiment to get the positive and negative word counts in separate columns and set missing values to 0 with values_fill().
  • Add a column with the difference in overall block sentiment with net = positive - negative
  • Plot the net sentiment across each block and facet by book.
    • Use scales = "free_x" since the books are of different lengths.
tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from = sentiment, values_from = n,
    values_fill = list(n = 0)
  ) |>
  mutate(net = positive - negative) ->
janeaustensentiment

janeaustensentiment |>
  ggplot(aes(index, net, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

We can see the books differ in the number and placement of positive versus negative blocks.

15.7.4 Adjusting Sentiment Lexicons

Consider the Genre/Context for the Sentiment Words. Do they mean what they mean?

  • These are modern lexicons and 200 year old books.

We should probably look at which words contribute most to the positive and negative sentiment and be sure we want to include them as part of the sentiment.

Let’s get the count of the most common words and their sentiment.

tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) ->
bing_word_counts

bing_word_counts
# A tibble: 2,585 × 3
   word     sentiment     n
   <chr>    <chr>     <int>
 1 miss     negative   1855
 2 well     positive   1523
 3 good     positive   1380
 4 great    positive    981
 5 like     positive    725
 6 better   positive    639
 7 enough   positive    613
 8 happy    positive    534
 9 love     positive    495
10 pleasure positive    462
# ℹ 2,575 more rows

We can see “miss” might not be a good fit to consider as a negative word given the context/genre.

Let’s plot the top ten, in order, for each sentiment.

bing_word_counts |>
  group_by(sentiment) |>
  slice_max(order_by = n, n = 10) |>
  ungroup() |>
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

Something seems “amiss” for Jane Austen novels! “Miss” is probably not a negative word, but rather refers to a young woman.

15.7.4.1 Adjusting an Improper Sentiment: Two Approaches

  1. Take the word “miss” out of the data before doing the analysis (add to the stop words), or,
  2. Change the sentiment lexicon to no longer have “miss” as a negative.
15.7.4.1.1 Approach 1

Remove “miss” from the text by adding to the stop words data frame and repeating the analysis.

head(stop_words, n = 2)
# A tibble: 2 × 2
  word  lexicon
  <chr> <chr>  
1 a     SMART  
2 a's   SMART  
custom_stop_words <- bind_rows(
  tibble(word = c("miss"), lexicon = c("custom")),
  stop_words
)
## SMART is another lexicon
head(custom_stop_words)
# A tibble: 6 × 2
  word  lexicon
  <chr> <chr>  
1 miss  custom 
2 a     SMART  
3 a's   SMART  
4 able  SMART  
5 about SMART  
6 above SMART  
  • Now let’s redo the analysis with the new stop words.
austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlc]",
        ignore_case = TRUE
      )
    ))
  ) |>
  ungroup() |>
  ## use word so the inner_join will match with the nrc lexicon
  unnest_tokens(word, text) |>
  anti_join(custom_stop_words, by = "word") ->
tidy_books_no_miss

tidy_books_no_miss |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) ->
bing_word_counts

head(bing_word_counts)
# A tibble: 6 × 3
  word      sentiment     n
  <chr>     <chr>     <int>
1 happy     positive    534
2 love      positive    495
3 pleasure  positive    462
4 poor      negative    424
5 happiness positive    369
6 comfort   positive    292
bing_word_counts |>
  group_by(sentiment) |>
  slice_max(order_by = n, n = 10) |>
  ungroup() |>
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

15.7.4.1.2 Approach 2

Remove the word “miss” from the bing sentiment lexicon.

get_sentiments("bing") |>
  filter(word != "miss") ->
bing_no_miss

Redo the Analysis from the beginning:

tidy_books |>
  inner_join(bing_no_miss, by = "word") |>
  count(word, sentiment, sort = TRUE) ->
bing_word_counts

bing_word_counts
# A tibble: 2,584 × 3
   word     sentiment     n
   <chr>    <chr>     <int>
 1 well     positive   1523
 2 good     positive   1380
 3 great    positive    981
 4 like     positive    725
 5 better   positive    639
 6 enough   positive    613
 7 happy    positive    534
 8 love     positive    495
 9 pleasure positive    462
10 poor     negative    424
# ℹ 2,574 more rows
## visualize it
bing_word_counts |>
  group_by(sentiment) |>
  slice_max(order_by = n, n = 10) |>
  ungroup() |>
  mutate(word = fct_reorder(word, n)) |> #
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

15.7.4.2 Repeat the Plot by Chapter

Original and “No Miss” plots.

  • We’ll use the {patchwork} package to put side by side.
library(patchwork)
## Original
tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from = sentiment, values_from = n,
    values_fill = list(n = 0)
  ) |>
  mutate(net = positive - negative) ->
janeaustensentiment

# No Miss
tidy_books |>
  inner_join(bing_no_miss, by = "word") |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from = sentiment, values_from = n,
    values_fill = list(n = 0)
  ) |>
  mutate(net = positive - negative) ->
janeaustensentiment2

janeaustensentiment |>
  ggplot(aes(index, net, fill = book)) +
  geom_col(show.legend = FALSE) +
  ggtitle("With Miss as Negative") +
  facet_wrap(~book, ncol = 2, scales = "free_x") -> p1


janeaustensentiment2 |>
  ggplot(aes(index, net, fill = book)) +
  geom_col(show.legend = FALSE) +
  ggtitle("Without Miss as Negative") +
  facet_wrap(~book, ncol = 2, scales = "free_x") -> p2


p1 + p2

Compare the average net difference in sentiment in the two cases.

janeaustensentiment |>
  summarize(means = mean(net, na.rm = TRUE)) |>
  bind_rows(
    (janeaustensentiment2 |>
      summarize(means = mean(net, na.rm = TRUE))
    )
  )
# A tibble: 2 × 1
  means
  <dbl>
1  9.71
2 11.7 
  • Notice some minor variations in several places (Emma - block 110) and the average sentiment is over 2 points more positive.

We have used a bag of words sentiment analysis and a larger block of text (80 lines) to characterize Jane Austen’s books.

We have also adjusted the lexicon to remove words that appeared inappropriate for the context/genre.

15.7.5 {tidytext} Plotting Functions for Ordering within Facets

We were able to reorder the words above when we were just faceting by sentiment.

If we wanted to see the top five sentiments by book and sentiment instead of just overall across books, we could summarize by facet by both book and sentiment.

tidy_books |>
  inner_join(bing_no_miss, by = "word") |>
  count(word, sentiment, book, sort = TRUE) ->
bing_word_counts
head(bing_word_counts)
# A tibble: 6 × 4
  word  sentiment book                    n
  <chr> <chr>     <fct>               <int>
1 well  positive  Emma                  401
2 good  positive  Emma                  359
3 good  positive  Mansfield Park        326
4 well  positive  Mansfield Park        324
5 great positive  Emma                  264
6 well  positive  Sense & Sensibility   240
## visualize it
bing_word_counts |>
  group_by(book, sentiment) |>
  slice_max(order_by = n, n = 5) |>
  ungroup() |>
  mutate(word = fct_reorder(parse_factor(word), n)) |>
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(sentiment ~ book, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

Notice the words are now different for each book but are all in the same order, without regard to how often they appear in each book.

  • All the scales are the same so the negative words are compressed compared to the more common positive words.

There are two new functions in the {tidytext package} to create a different look.

  • reorder_within(), inside a mutate, allows you to reorder each word by the faceted book and sentiment based on the count.
  • scale_x_reordered() will then update the x axis to accommodate the new orders.

Use scales = "free" inside the facet_wrap() to allow both x and y scales to vary for each part of the facet.

bing_word_counts |>
  group_by(book, sentiment) |>
  slice_max(order_by = n, n = 5) |>
  ungroup() |>
  mutate(word = reorder_within(word, n, book)) |>
  ungroup() |>
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(sentiment ~ book, scales = "free") +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

15.7.6 Analyzing Sentences and Chapters

The sentiment analysis we just did was based on single words and so did not consider the presence of modifiers such as “not” which tend to flip the context.

15.7.6.1 Example: Sentence Level

Consider the data set prideprejudice which has the complete text divided into elements of up to about 70 characters each.

If the unit for tokenizing is n-grams, skip_ngrams, sentences, lines, paragraphs, or regex, unnest_tokens() will collapse the entire input together before tokenizing unless collapse = FALSE.

Let’s add a chapter variable and also add a period after the number.

  • unnest_tokens() separates sentences at periods so we will get rid of periods after Mr., Mrs., and Dr. as a small clean up in addition to seperating the chapters headings.
tibble(text = prideprejudice) |>
  mutate(
    chapter = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlc]", ignore_case = TRUE)
    )),
    text = str_replace(text, "(Chapter \\d+)", "\\1\\."),
    text = str_replace_all(text, "((Mr)|(Mrs)|(Dr))\\.", "\\1")
  ) |>
  unnest_tokens(sentence, text, token = "sentences") ->
PandP_sentences

Now we have our tokens as “complete” sentences. We have more cleaning and reshaping to do.

  • Let’s add sentence numbers and un_nest at the “word” level.
  • Add sentiments using bing.
  • We can get rid of the cover page (Chapter 0).
  • We’ll count() the number of positive and negative words per sentence.
  • As before, we will pivot_wider() to break out the sentiments.
  • Now we can use case_when() to create a score for each sentence:
    • 1 for more positive words than negative,
    • 0 for same numbers of positive and negative, and,
    • -1 for more negative words than positive in the sentence.
  • Finally, let’s summarize by chapter as an average score per sentence (the total score divided by the number of sentences in the chapter).

Now we can create a line plot of sentiment score by chapter to see a view of the story arc.

PandP_sentences |>
  mutate(sentence_number = row_number()) |>
  unnest_tokens(word, sentence) |>
  inner_join(get_sentiments("bing"), by = "word") |>
  filter(chapter > 0) |>
  count(chapter, sentence_number, sentiment) |> ## view()
  pivot_wider(
    names_from = sentiment, values_from = n,
    values_fill = list(n = 0)
  ) |> # view()
  mutate(sentence_sent = positive - negative) |>
  mutate(sentence_sent = case_when(
    sentence_sent > 0 ~ 1,
    sentence_sent == 0 ~ 0,
    sentence_sent < 0 ~ -1
  )) |>
  group_by(chapter) |>
  summarize(
    chap_sent_per = sum(sentence_sent) / n(),
    .groups = "keep"
  ) |> # view()
  ggplot(aes(chapter, chap_sent_per)) +
  geom_line() +
  ggtitle("Sentence Sentiment Score per Chapter") +
  ylab("(Score/Total Sentences in a Chapter") +
  xlab("Chapter") +
  geom_hline(yintercept = 0, color = "red", alpha = .4, lty = 2) +
  scale_x_continuous(limits = c(1, 61)) +
  geom_rug(sides = "b")

15.7.6.2 Example: Chapter Level

Consider all the Austen books.

  • Look for the most negative chapters based on number of words in the chapter.
  • Take out the word “miss”.
get_sentiments("bing") |>
  filter(sentiment == "negative") |>
  filter(word != "miss") ->
bingnegative

tidy_books |>
  group_by(book, chapter) |>
  summarize(words = n(), .groups = "drop") ->
wordcounts

tidy_books |>
  semi_join(bingnegative, by = "word") |>
  group_by(book, chapter) |>
  summarize(negativewords = n(), .groups = "drop") |>
  left_join(wordcounts, by = c("book", "chapter")) |>
  mutate(ratio = negativewords / words) |>
  filter(chapter != 0) |>
  ungroup() |>
  group_by(book) |>
  slice_max(order_by = ratio) |>
  ungroup()
# A tibble: 6 × 5
  book                chapter negativewords words  ratio
  <fct>                 <int>         <int> <int>  <dbl>
1 Sense & Sensibility      43           156  3405 0.0458
2 Pride & Prejudice        34           111  2104 0.0528
3 Mansfield Park           46           161  3685 0.0437
4 Emma                     16            81  1894 0.0428
5 Northanger Abbey         21           143  2982 0.0480
6 Persuasion                4            62  1807 0.0343

These are the chapters with the most negative words in each book, normalized for the number of words in the chapter.

What is happening in these chapters?

  • In Chapter 43 of Sense and Sensibility Marianne is seriously ill, near death.
  • In Chapter 34 of Pride and Prejudice, Mr. Darcy proposes for the first time (so badly!).
  • In Chapter 46 of Mansfield Park, almost the end, everyone learns of Henry’s scandalous adultery.
  • In Chapter 16 of Emma, she is back at Hartfield after her ride with Mr. Elton, and Emma plunges into self-recrimination as she looks back over the past weeks.
  • In Chapter 21 of Northanger Abbey, Catherine is deep in her Gothic faux-fantasy of murder, etc..
  • In Chapter 4 of Persuasion, the reader gets the full flashback of Anne refusing Captain Wentworth and sees how sad she was and now realizes it was a terrible mistake.

We have seen multiple ways to use sentiment analysis in single words and large blocks of text to analyze the flow of sentiment within and across large works of text.

The same concepts and techniques can work with analyzing Reddit comments, tweets, Yelp reviews, etc..

15.8 Word Cloud Plots

Word Clouds are a popular graphical method for displaying Word Frequency in a non-statistical way that can be useful for identifying the most frequent words in a document.

The {wordcloud} package (Fellows (2018)) uses base R graphics to create Word Clouds.

  • It includes functions to create “commonality clouds” or “comparison clouds” for comparing words across multiple documents.

Install the package using the console and load into your file.

Let’s create a word cloud of tidy_books without the stop words.

library(wordcloud)
tidy_books |>
  anti_join(stop_words, by = "word") |>
  count(word) |>
  with(wordcloud(word, n, max.words = 30))

## Custom Stop Words - no miss
tidy_books |>
  anti_join(custom_stop_words, by = "word") |>
  count(word) |>
  with(wordcloud(word, n, max.words = 30))

Word Clouds are popular and you can make them, but should you?

As an alternative, consider the ChatterPlot.

A Chatter Plot has more information about the presence of words than font size.

Try a repeat of top 50 Jane Austen words by sentiment and books.

library(ggrepel) ## to help words "repel each other
tidy_books |>
  inner_join(bing_no_miss, by = "word") |>
  count(book, word, sentiment, sort = TRUE) |>
  mutate(proportion = n / sum(n)) |>
  group_by(sentiment) |>
  slice_max(order_by = n, n = 50) |>
  ungroup() ->
tempp
tempp |>
  ggplot(aes(book, proportion, label = word)) +
  ## ggrepel geom, make arrows transparent, color by rank, size by n
  geom_text_repel(
    segment.alpha = 0,
    aes(
      color = sentiment, size = proportion,
      ## fontface = as.numeric(as.factor(book))
    ),
    max.overlaps = 50
  ) +
  ## set word size range & turn off legend
  scale_size_continuous(range = c(3, 6), guide = "none") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Top 50 Words by Sentiment in Each Book")

At times, you may be asked to create a Word Cloud and it is straightforward to do so. However, it really only provides a visual display where the top few words can be seen and comparisons may be difficult.