library(readtext)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(quanteda)

1 Cleaning texts

Cleaning texts is tedious but unavoidable work when working with them. The most basic pre-processing steps that we have to do is remove capitalization, numbers, and other sort of noise that might happen during the data acquisition process (e.g.: html tags after scraping).

The first example showcases how to use the stringr package for this purpose. In the below case, we have the \n newline and the font size html tag that we want to get rid of.

text1 <- c("Something is wrong \nBut I don't know what    ", "   The <font size='6'> bridge is too far")

text1
#> [1] "Something is wrong \nBut I don't know what    "
#> [2] "   The <font size='6'> bridge is too far"

In str_replace_all(pattern = "<.*?>|\n", replacement = "") what we do is specify a regular expression (regex) in the pattern argument. This tells the function, that we want everything within < > matched OR the exact string of \n. After the regex match is achieved, we replace it to an empty character.

The str_to_lower() converts everything to lower case and finally, str_trim() removes excess white space around the text.

text1 %>% 
    str_replace_all(pattern =  "<.*?>|\n", replacement = "") %>% 
    str_to_lower() %>% 
    str_trim()
#> [1] "something is wrong but i don't know what"
#> [2] "the  bridge is too far"

This is a very basic example of text pre-processing using the stringr package. For more essentials, and a quick tutorial on regular expressions in R, see the Chapter 14 of R for Data Science.

2 Importing text

We use the readtext package to import texts into R. The data is the first UN General Assembly speech by US presidents after their inauguration. The readtext() function can read all text documents in a given folder with the *.txt expression. It is a versatile package and can read texts from URLs, zips, with strange encodings.

unga_texts <- readtext("data/unga/*.txt")

glimpse(unga_texts)
#> Rows: 8
#> Columns: 2
#> $ doc_id <chr> "clinton93.txt", "clinton97.txt", "hwbush90.txt", "obama09.t...
#> $ text   <chr> "Thank you very much. Mr. President, let me first congratula...

Using some string manipulation we can get additional document attributes by parsing the doc_id. It is done with the stringr package. We clean up the doc_id, then get the name of the president and the year. We use the str_extract() function to get all the characters before the first dot, by supplying the regular expression "[^\\.]*"

The str_sub function subsets the given string starting from the specified position. For the date, we first parse the last two digit, then extend it to a date format by adding the month and date, then extract the year from that date. For this, we use the lubridate::year and lubridate::ymd functions, as well as the str_c function to combine strings.

unga_texts$doc_id <- str_extract(unga_texts$doc_id, "[^\\.]*")

unga_texts$potus <- str_sub(unga_texts$doc_id, end = -3)

unga_texts$year <- str_sub(unga_texts$doc_id, start = -2) %>% 
    str_c("-01-01") %>% 
    lubridate::ymd() %>% 
    lubridate::year()
    

glimpse(unga_texts)
#> Rows: 8
#> Columns: 4
#> $ doc_id <chr> "clinton93", "clinton97", "hwbush90", "obama09", "obama13", ...
#> $ text   <chr> "Thank you very much. Mr. President, let me first congratula...
#> $ potus  <chr> "clinton", "clinton", "hwbush", "obama", "obama", "trump", "...
#> $ year   <dbl> 1993, 1997, 1990, 2009, 2013, 2017, 2001, 2005

3 Cleaning and pre-processing

First we create a corpus from our data frame.

unga_corpus <- corpus(unga_texts)

At this point we still have all the noise and clutter in our data. Let’s clean it! We can pass our corpus object to the tokens function which will tokenize it. Tokens are going to be our unit of analysis. They can be single words (unigrams) or n-word combinations (n-grams) for more refined analysis. Similarly, tokens can be whole sentences as well. What tokens we choose should be informed and guided by our research question and the appropriate method for answering it.

During this step we can remove common words of no interest (referred as stopwords), numbers, special characters, transform the text to lowercase and stem the words. Remember to remove stopwords before stemming! Because stopwords are not stemmed they will miss the stemmed words in the text.

Example of stopwords:

head(stopwords(language = "english"), 15)
#>  [1] "i"          "me"         "my"         "myself"     "we"        
#>  [6] "our"        "ours"       "ourselves"  "you"        "your"      
#> [11] "yours"      "yourself"   "yourselves" "he"         "him"

Let’s tokenize our corpus.

unga_tok <- tokens(unga_corpus, what = "word", remove_symbols = TRUE, remove_numbers = TRUE, remove_punct = TRUE) %>% 
    tokens_tolower() %>% 
    tokens_remove(stopwords("english")) %>% 
    tokens_wordstem()

# first 20 tokens in the first document
head(unga_tok[[1]], 20)
#>  [1] "thank"           "much"            "mr"              "presid"         
#>  [5] "let"             "first"           "congratul"       "elect"          
#>  [9] "presid"          "general"         "assembl"         "mr"             
#> [13] "secretary-gener" "distinguish"     "deleg"           "guest"          
#> [17] "great"           "honor"           "address"         "stand"

Most of our analysis will require a document feature matrix (DFM), where our tokens will be put into a \(n*m\) sparse matrix, where \(n=\) number of documents, \(m=\) number of features (tokens). We can do all the pre-processing and normalizing procedure in one step, skipping the tokens function, or just put our token object into the dfm function.

unga_dfm <- dfm(unga_corpus, tolower = TRUE, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE)

# which is the same as:
dfm(unga_tok)
#> Document-feature matrix of: 8 documents, 2,765 features (68.8% sparse) and 2 docvars.
#>            features
#> docs        thank much mr presid let first congratul elect general assembl
#>   clinton93     2    4  2     10  18     8         1     5       5       4
#>   clinton97     1    3  1      3   1     6         0     1       1       6
#>   hwbush90      2    6  2      3   3     4         2     3       5       5
#>   obama09       1    1  2      4   2     5         0     0       0       3
#>   obama13       3    2  2     10   1     1         0     2       1       2
#>   trump17       8    3  2      5   7     6         0     3       2       1
#> [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 2,755 more features ]

unga_dfm
#> Document-feature matrix of: 8 documents, 2,765 features (68.8% sparse) and 2 docvars.
#>            features
#> docs        thank much mr presid let first congratul elect general assembl
#>   clinton93     2    4  2     10  18     8         1     5       5       4
#>   clinton97     1    3  1      3   1     6         0     1       1       6
#>   hwbush90      2    6  2      3   3     4         2     3       5       5
#>   obama09       1    1  2      4   2     5         0     0       0       3
#>   obama13       3    2  2     10   1     1         0     2       1       2
#>   trump17       8    3  2      5   7     6         0     3       2       1
#> [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 2,755 more features ]

3.1 Word frequency, weights

What are the most frequent features?

topfeatures(unga_dfm, 15)
#>  nation   world    unit   peopl    must   state     can    peac    work     new 
#>     370     235     223     186     152     146     122     119     103     100 
#> countri      us   secur   everi america 
#>      95      93      86      84      79

Get more information with textstat_frequency

freq <- textstat_frequency(unga_dfm, n = 5, groups = docvars(unga_dfm, "potus"))

freq
#>      feature frequency rank docfreq   group
#> 1     nation       105    1       2 clinton
#> 2       unit        65    2       2 clinton
#> 3      world        63    3       2 clinton
#> 4        u.n        44    4       2 clinton
#> 5       must        42    5       2 clinton
#> 6     nation        35    1       1  hwbush
#> 7      world        32    2       1  hwbush
#> 8       unit        31    3       1  hwbush
#> 9        new        27    4       1  hwbush
#> 10      year        14    5       1  hwbush
#> 11    nation        80    1       2   obama
#> 12     peopl        68    2       2   obama
#> 13     world        61    3       2   obama
#> 14      peac        56    4       2   obama
#> 15       can        51    5       2   obama
#> 16    nation        66    1       1   trump
#> 17     peopl        50    2       1   trump
#> 18      unit        41    3       1   trump
#> 19   countri        29    4       1   trump
#> 20     world        28    5       1   trump
#> 21    nation        84    1       2   wbush
#> 22     world        51    2       2   wbush
#> 23      unit        41    3       2   wbush
#> 24      must        40    4       2   wbush
#> 25 terrorist        39    5       2   wbush

We can plot this, as it is a nice data frame at this point. A neat little trick that we do here is to use the tidytext::reorder_within and the tidytext::scale_x_reordered to make sure each of the faceted plots display the terms in their correct order.

ggplot(freq, aes(x = tidytext::reorder_within(feature, frequency, group), y = frequency)) +
    geom_point() +
    coord_flip() +
    labs(x = NULL,
         y = "Frequency") +
    facet_wrap(~group, scales = "free") +
    tidytext::scale_x_reordered()

We can of course perform all of the above with a trimmed dfm (either based on term frequency or document frequency) and adding weights to our features. Trimming the dfm happens with the dfm_trim function, while weighting the features is carried out with dfm_weight and dfm_tfidf.

unga_tfidf <- dfm_tfidf(unga_dfm)

textstat_frequency(unga_tfidf, groups = "potus", force = TRUE, n = 5)
#>        feature frequency rank docfreq   group
#> 1            s  4.685656    1       2 clinton
#> 2    peacekeep  4.286520    2       2 clinton
#> 3        adequ  3.612360    3       1 clinton
#> 4   strengthen  3.061800    4       2 clinton
#> 5         half  3.010300    5       1 clinton
#> 6       kuwait  4.515450    1       1  hwbush
#> 7      central  3.612360    2       1  hwbush
#> 8   membership  2.709270    3       1  hwbush
#> 9        bridg  2.408240    4       1  hwbush
#> 10        less  2.408240    4       1  hwbush
#> 11 palestinian  8.519375    1       2   obama
#> 12       syria  7.241468    2       2   obama
#> 13     iranian  7.224720    3       1   obama
#> 14        issu  6.815500    4       2   obama
#> 15        iran  6.815500    4       2   obama
#> 16 sovereignti  6.020600    1       1   trump
#> 17   sovereign  4.685656    2       1   trump
#> 18     applaus  4.515450    3       1   trump
#> 19   venezuela  4.515450    3       1   trump
#> 20     patriot  4.515450    3       1   trump
#> 21      afghan  5.418540    1       1   wbush
#> 22   monterrey  5.418540    1       1   wbush
#> 23        doha  5.418540    1       1   wbush
#> 24     septemb  4.816480    4       2   wbush
#> 25     subsidi  4.515450    5       1   wbush

Descriptive statistics for texts

and some other basics

2020 November

1 Cleaning texts

2 Importing text

3 Cleaning and pre-processing

3.1 Word frequency, weights