library(readtext)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(quanteda)
Cleaning texts is tedious but unavoidable work when working with them. The most basic pre-processing steps that we have to do is remove capitalization, numbers, and other sort of noise that might happen during the data acquisition process (e.g.: html tags after scraping).
The first example showcases how to use the stringr
package for this purpose. In the below case, we have the \n
newline and the font size html tag that we want to get rid of.
text1 <- c("Something is wrong \nBut I don't know what ", " The <font size='6'> bridge is too far")
text1
#> [1] "Something is wrong \nBut I don't know what "
#> [2] " The <font size='6'> bridge is too far"
In str_replace_all(pattern = "<.*?>|\n", replacement = "")
what we do is specify a regular expression (regex) in the pattern argument. This tells the function, that we want everything within < >
matched OR the exact string of \n
. After the regex match is achieved, we replace it to an empty character.
The str_to_lower()
converts everything to lower case and finally, str_trim()
removes excess white space around the text.
text1 %>%
str_replace_all(pattern = "<.*?>|\n", replacement = "") %>%
str_to_lower() %>%
str_trim()
#> [1] "something is wrong but i don't know what"
#> [2] "the bridge is too far"
This is a very basic example of text pre-processing using the stringr
package. For more essentials, and a quick tutorial on regular expressions in R, see the Chapter 14 of R for Data Science.
We use the readtext
package to import texts into R. The data is the first UN General Assembly speech by US presidents after their inauguration. The readtext()
function can read all text documents in a given folder with the *.txt
expression. It is a versatile package and can read texts from URLs, zips, with strange encodings.
unga_texts <- readtext("data/unga/*.txt")
glimpse(unga_texts)
#> Rows: 8
#> Columns: 2
#> $ doc_id <chr> "clinton93.txt", "clinton97.txt", "hwbush90.txt", "obama09.t...
#> $ text <chr> "Thank you very much. Mr. President, let me first congratula...
Using some string manipulation we can get additional document attributes by parsing the doc_id
. It is done with the stringr
package. We clean up the doc_id
, then get the name of the president and the year. We use the str_extract()
function to get all the characters before the first dot, by supplying the regular expression "[^\\.]*"
The str_sub
function subsets the given string starting from the specified position. For the date, we first parse the last two digit, then extend it to a date format by adding the month and date, then extract the year from that date. For this, we use the lubridate::year
and lubridate::ymd
functions, as well as the str_c
function to combine strings.
unga_texts$doc_id <- str_extract(unga_texts$doc_id, "[^\\.]*")
unga_texts$potus <- str_sub(unga_texts$doc_id, end = -3)
unga_texts$year <- str_sub(unga_texts$doc_id, start = -2) %>%
str_c("-01-01") %>%
lubridate::ymd() %>%
lubridate::year()
glimpse(unga_texts)
#> Rows: 8
#> Columns: 4
#> $ doc_id <chr> "clinton93", "clinton97", "hwbush90", "obama09", "obama13", ...
#> $ text <chr> "Thank you very much. Mr. President, let me first congratula...
#> $ potus <chr> "clinton", "clinton", "hwbush", "obama", "obama", "trump", "...
#> $ year <dbl> 1993, 1997, 1990, 2009, 2013, 2017, 2001, 2005
First we create a corpus from our data frame.
At this point we still have all the noise and clutter in our data. Let’s clean it! We can pass our corpus object to the tokens
function which will tokenize it. Tokens are going to be our unit of analysis. They can be single words (unigrams) or n-word combinations (n-grams) for more refined analysis. Similarly, tokens can be whole sentences as well. What tokens we choose should be informed and guided by our research question and the appropriate method for answering it.
During this step we can remove common words of no interest (referred as stopwords), numbers, special characters, transform the text to lowercase and stem the words. Remember to remove stopwords before stemming! Because stopwords are not stemmed they will miss the stemmed words in the text.
Example of stopwords:
head(stopwords(language = "english"), 15)
#> [1] "i" "me" "my" "myself" "we"
#> [6] "our" "ours" "ourselves" "you" "your"
#> [11] "yours" "yourself" "yourselves" "he" "him"
Let’s tokenize our corpus.
unga_tok <- tokens(unga_corpus, what = "word", remove_symbols = TRUE, remove_numbers = TRUE, remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
tokens_wordstem()
# first 20 tokens in the first document
head(unga_tok[[1]], 20)
#> [1] "thank" "much" "mr" "presid"
#> [5] "let" "first" "congratul" "elect"
#> [9] "presid" "general" "assembl" "mr"
#> [13] "secretary-gener" "distinguish" "deleg" "guest"
#> [17] "great" "honor" "address" "stand"
Most of our analysis will require a document feature matrix (DFM), where our tokens will be put into a \(n*m\) sparse matrix, where \(n=\) number of documents, \(m=\) number of features (tokens). We can do all the pre-processing and normalizing procedure in one step, skipping the tokens
function, or just put our token object into the dfm
function.
unga_dfm <- dfm(unga_corpus, tolower = TRUE, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE)
# which is the same as:
dfm(unga_tok)
#> Document-feature matrix of: 8 documents, 2,765 features (68.8% sparse) and 2 docvars.
#> features
#> docs thank much mr presid let first congratul elect general assembl
#> clinton93 2 4 2 10 18 8 1 5 5 4
#> clinton97 1 3 1 3 1 6 0 1 1 6
#> hwbush90 2 6 2 3 3 4 2 3 5 5
#> obama09 1 1 2 4 2 5 0 0 0 3
#> obama13 3 2 2 10 1 1 0 2 1 2
#> trump17 8 3 2 5 7 6 0 3 2 1
#> [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 2,755 more features ]
unga_dfm
#> Document-feature matrix of: 8 documents, 2,765 features (68.8% sparse) and 2 docvars.
#> features
#> docs thank much mr presid let first congratul elect general assembl
#> clinton93 2 4 2 10 18 8 1 5 5 4
#> clinton97 1 3 1 3 1 6 0 1 1 6
#> hwbush90 2 6 2 3 3 4 2 3 5 5
#> obama09 1 1 2 4 2 5 0 0 0 3
#> obama13 3 2 2 10 1 1 0 2 1 2
#> trump17 8 3 2 5 7 6 0 3 2 1
#> [ reached max_ndoc ... 2 more documents, reached max_nfeat ... 2,755 more features ]
What are the most frequent features?
topfeatures(unga_dfm, 15)
#> nation world unit peopl must state can peac work new
#> 370 235 223 186 152 146 122 119 103 100
#> countri us secur everi america
#> 95 93 86 84 79
Get more information with textstat_frequency
freq <- textstat_frequency(unga_dfm, n = 5, groups = docvars(unga_dfm, "potus"))
freq
#> feature frequency rank docfreq group
#> 1 nation 105 1 2 clinton
#> 2 unit 65 2 2 clinton
#> 3 world 63 3 2 clinton
#> 4 u.n 44 4 2 clinton
#> 5 must 42 5 2 clinton
#> 6 nation 35 1 1 hwbush
#> 7 world 32 2 1 hwbush
#> 8 unit 31 3 1 hwbush
#> 9 new 27 4 1 hwbush
#> 10 year 14 5 1 hwbush
#> 11 nation 80 1 2 obama
#> 12 peopl 68 2 2 obama
#> 13 world 61 3 2 obama
#> 14 peac 56 4 2 obama
#> 15 can 51 5 2 obama
#> 16 nation 66 1 1 trump
#> 17 peopl 50 2 1 trump
#> 18 unit 41 3 1 trump
#> 19 countri 29 4 1 trump
#> 20 world 28 5 1 trump
#> 21 nation 84 1 2 wbush
#> 22 world 51 2 2 wbush
#> 23 unit 41 3 2 wbush
#> 24 must 40 4 2 wbush
#> 25 terrorist 39 5 2 wbush
We can plot this, as it is a nice data frame at this point. A neat little trick that we do here is to use the tidytext::reorder_within
and the tidytext::scale_x_reordered
to make sure each of the faceted plots display the terms in their correct order.
ggplot(freq, aes(x = tidytext::reorder_within(feature, frequency, group), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = NULL,
y = "Frequency") +
facet_wrap(~group, scales = "free") +
tidytext::scale_x_reordered()
We can of course perform all of the above with a trimmed dfm (either based on term frequency or document frequency) and adding weights to our features. Trimming the dfm happens with the dfm_trim
function, while weighting the features is carried out with dfm_weight
and dfm_tfidf
.
unga_tfidf <- dfm_tfidf(unga_dfm)
textstat_frequency(unga_tfidf, groups = "potus", force = TRUE, n = 5)
#> feature frequency rank docfreq group
#> 1 s 4.685656 1 2 clinton
#> 2 peacekeep 4.286520 2 2 clinton
#> 3 adequ 3.612360 3 1 clinton
#> 4 strengthen 3.061800 4 2 clinton
#> 5 half 3.010300 5 1 clinton
#> 6 kuwait 4.515450 1 1 hwbush
#> 7 central 3.612360 2 1 hwbush
#> 8 membership 2.709270 3 1 hwbush
#> 9 bridg 2.408240 4 1 hwbush
#> 10 less 2.408240 4 1 hwbush
#> 11 palestinian 8.519375 1 2 obama
#> 12 syria 7.241468 2 2 obama
#> 13 iranian 7.224720 3 1 obama
#> 14 issu 6.815500 4 2 obama
#> 15 iran 6.815500 4 2 obama
#> 16 sovereignti 6.020600 1 1 trump
#> 17 sovereign 4.685656 2 1 trump
#> 18 applaus 4.515450 3 1 trump
#> 19 venezuela 4.515450 3 1 trump
#> 20 patriot 4.515450 3 1 trump
#> 21 afghan 5.418540 1 1 wbush
#> 22 monterrey 5.418540 1 1 wbush
#> 23 doha 5.418540 1 1 wbush
#> 24 septemb 4.816480 4 2 wbush
#> 25 subsidi 4.515450 5 1 wbush