library(readtext)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(GGally)
library(quanteda)
Let’s work with the UNGA texts again for this session again. Load it and let’s describe it! We’ll work with the packages that we are already familiar with from our intro to R session.
unga_texts <- readtext("data/unga/*.txt")
head(unga_texts)
#> readtext object consisting of 6 documents and 0 docvars.
#> # Description: df[,2] [6 x 2]
#> doc_id text
#> * <chr> <chr>
#> 1 clinton93.txt "\"Thank you \"..."
#> 2 clinton97.txt "\"Mr. Presid\"..."
#> 3 hwbush90.txt "\"Mr. Presid\"..."
#> 4 obama09.txt "\"Mr. Presid\"..."
#> 5 obama13.txt "\"Mr. Presid\"..."
#> 6 trump17.txt "\"Mr. Secret\"..."
For practice, we carry out the same few modification as we did in the last hands on session. We also add a party dummy to our data. As opposed to the approach we took in the previous session we’ll use the mutate
function instead of the $
method.
unga_texts <- unga_texts %>%
mutate(doc_id = str_extract(doc_id, "[^\\.]*"),
potus = str_sub(doc_id, end = -3),
party = if_else(potus %in% c("obama", "clinton"), "dem", "rep"))
unga_texts$year <- str_sub(unga_texts$doc_id, start = -2) %>%
str_c("-01-01") %>%
lubridate::ymd() %>%
lubridate::year()
glimpse(unga_texts)
#> Rows: 8
#> Columns: 5
#> $ doc_id <chr> "clinton93", "clinton97", "hwbush90", "obama09", "obama13", ...
#> $ text <chr> "Thank you very much. Mr. President, let me first congratula...
#> $ potus <chr> "clinton", "clinton", "hwbush", "obama", "obama", "trump", "...
#> $ party <chr> "dem", "dem", "rep", "dem", "dem", "rep", "rep", "rep"
#> $ year <dbl> 1993, 1997, 1990, 2009, 2013, 2017, 2001, 2005
Now we will create our corpus object in quanteda and prepare some summary statistics. Tokens are individual words, types are unique words in our corpus.
unga_corpus <- corpus(unga_texts)
summary(unga_corpus)
#> Corpus consisting of 8 documents, showing 8 documents:
#>
#> Text Types Tokens Sentences potus party year
#> clinton93 1438 5291 227 clinton dem 1993
#> clinton97 1061 3215 120 clinton dem 1997
#> hwbush90 952 3064 123 hwbush rep 1990
#> obama09 1432 5533 246 obama dem 2009
#> obama13 1566 6128 236 obama dem 2013
#> trump17 1364 5136 220 trump rep 2017
#> wbush01 914 2812 173 wbush rep 2001
#> wbush05 1072 3575 160 wbush rep 2005
If we want to subset our corpus, we should use the corpus_subset
function.
unga_dem <- corpus_subset(unga_corpus, potus %in% c("clinton", "obama"))
summary(unga_dem)
#> Corpus consisting of 4 documents, showing 4 documents:
#>
#> Text Types Tokens Sentences potus party year
#> clinton93 1438 5291 227 clinton dem 1993
#> clinton97 1061 3215 120 clinton dem 1997
#> obama09 1432 5533 246 obama dem 2009
#> obama13 1566 6128 236 obama dem 2013
We can add more descriptive statistics as well. The trick is that the summary
function will create a dataframe object for us, which we can then treat as such and use our data manipulation tools.
summary(unga_corpus) %>%
group_by(party) %>%
summarise(mean_wordcount = mean(Tokens), std_dev = sd(Tokens), min_wordc = min(Tokens), max_wordc = max(Tokens))
#> # A tibble: 2 x 5
#> party mean_wordcount std_dev min_wordc max_wordc
#> <chr> <dbl> <dbl> <int> <int>
#> 1 dem 5042. 1268. 3215 6128
#> 2 rep 3647. 1042. 2812 5136
With the textstat_collocations
functions we can have two general approach. Feed the function a corpus or the tokens that we created (and with stopwords removed.) With the whole corpus without any pre-processing, we get the following result.
unga_corpus %>%
textstat_collocations() %>%
head(n = 10)
#> collocation count count_nested length lambda z
#> 1 united nations 115 0 2 5.205054 33.35593
#> 2 united states 96 0 2 6.643355 28.66361
#> 3 we must 76 0 2 4.062952 23.81878
#> 4 it is 52 0 2 3.962046 22.08843
#> 5 those who 30 0 2 5.945595 21.65530
#> 6 we will 73 0 2 3.086902 21.21462
#> 7 human rights 27 0 2 6.589469 20.99931
#> 8 the world 120 0 2 3.088370 20.52294
#> 9 we have 58 0 2 3.140151 19.24971
#> 10 a new 47 0 2 3.818535 18.63462
After tokenization and removing our stopwords:
unga_corpus %>%
tokens() %>%
tokens_select(pattern = stopwords("en"), selection = "remove") %>%
textstat_collocations(size = c(2:3)) %>%
head(n = 10)
#> collocation count count_nested length lambda z
#> 1 united nations 115 109 2 4.525644 28.61105
#> 2 united states 96 91 2 5.992541 24.89221
#> 3 human rights 27 27 2 5.971306 18.53176
#> 4 nuclear weapons 17 17 2 4.975837 15.10074
#> 5 international community 13 13 2 5.524524 13.49280
#> 6 let us 14 14 2 4.891058 13.44072
#> 7 every nation 15 14 2 4.337865 13.34576
#> 8 years ago 17 17 2 8.337951 13.06237
#> 9 general assembly 14 13 2 8.876684 12.70576
#> 10 work together 12 12 2 4.589589 12.63272
A third approach is see if in a weighted dfm we have particularly highly weighted bigrams or trigrams. This approach seem to give us cruder and less refined collocations.
unga_corpus %>%
tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE) %>%
tokens_select(pattern = stopwords("en"), selection = "remove") %>%
tokens_ngrams(n = 2:3) %>%
dfm() %>%
dfm_tfidf(scheme_tf = "prop") %>%
textstat_frequency(n = 10, force = TRUE)
#> feature frequency rank docfreq group
#> 1 new_partnership 0.001581594 1 1 all
#> 2 every_nation 0.001524471 2 4 all
#> 3 u.n_s 0.001352867 3 3 all
#> 4 let_us 0.001341727 4 3 all
#> 5 monterrey_consensus 0.001280979 5 1 all
#> 6 partnership_nations 0.001265275 6 1 all
#> 7 see_world 0.001265275 6 1 all
#> 8 new_partnership_nations 0.001265275 6 1 all
#> 9 u.n_peacekeeping 0.001243192 9 1 all
#> 10 september_11th 0.001115575 10 2 all
Now we are wondering how the lexical diversity differs between some of our documents. To check the implemented lexical diversity approaches in the quanteda
package, let’s examine the textstat_lexdiv
function. The data input needs to be a dfm, so prepare that as the first step. Again, the result is a data frame, which we can treat accordingly and order our results decreasingly.
unga_dfm <- unga_corpus %>%
tokens(remove_punct = TRUE, remove_separators = TRUE, remove_hyphens = TRUE) %>%
dfm(remove = stopwords("en"))
#> Warning: 'remove_hyphens' is deprecated, use 'split_hyphens' instead.
unga_dfm %>%
textstat_lexdiv(measure = "CTTR") %>%
arrange(desc(CTTR))
#> document CTTR
#> 1 obama13 17.79937
#> 2 obama09 17.34122
#> 3 clinton93 17.06607
#> 4 trump17 16.93697
#> 5 clinton97 15.89893
#> 6 wbush05 15.00893
#> 7 wbush01 14.90223
#> 8 hwbush90 14.69553
If we are still undiceded on what diversity measure we should use, the measure = all
will give us the result for all the implemented measures.
unga_dfm %>%
textstat_lexdiv(measure = "all")
#> document TTR C R CTTR U S K
#> 1 clinton93 0.4778518 0.9058603 24.13507 17.06607 36.18781 0.9193388 27.87049
#> 2 clinton97 0.5642317 0.9223511 22.48448 15.89893 41.22210 0.9305242 28.71822
#> 3 hwbush90 0.5488145 0.9174496 20.78261 14.69553 38.23784 0.9250464 36.19023
#> 4 obama09 0.4913689 0.9091406 24.52418 17.34122 37.38056 0.9220947 23.94006
#> 5 obama13 0.4697075 0.9051027 25.17212 17.79937 36.44132 0.9196394 19.96512
#> 6 trump17 0.4903604 0.9083734 23.95249 16.93697 36.86341 0.9210476 30.07204
#> 7 wbush01 0.5890625 0.9260305 21.07494 14.90223 42.00666 0.9322163 29.50439
#> 8 wbush05 0.5045198 0.9085209 21.22584 15.00893 35.50507 0.9185612 31.35114
#> I D Vm Maas lgV0 lgeV0
#> 1 87.83314 0.002395985 0.04434754 0.1662336 7.285586 16.77568
#> 2 126.50741 0.002243512 0.04190168 0.1557525 7.641484 17.59517
#> 3 93.06822 0.002923712 0.04846004 0.1617161 7.279024 16.76057
#> 4 109.90947 0.001993361 0.03971161 0.1635600 7.413833 17.07098
#> 5 120.36517 0.001648897 0.03542912 0.1656544 7.361389 16.95023
#> 6 85.82445 0.002589178 0.04639508 0.1647033 7.337348 16.89487
#> 7 139.34216 0.002170885 0.04030111 0.1542911 7.623214 17.55310
#> 8 89.31000 0.002571596 0.04489202 0.1678243 7.062165 16.26124
How much are these measures correlated with each other?
div_df <- unga_dfm %>%
textstat_lexdiv(measure = "all")
cor(div_df[,2:13])
#> TTR C R CTTR U S
#> TTR 1.0000000 0.99084932 -0.78388824 -0.78388824 0.9035902 0.934078544
#> C 0.9908493 1.00000000 -0.69305774 -0.69305774 0.9523527 0.973628513
#> R -0.7838882 -0.69305774 1.00000000 1.00000000 -0.4441618 -0.510610645
#> CTTR -0.7838882 -0.69305774 1.00000000 1.00000000 -0.4441618 -0.510610645
#> U 0.9035902 0.95235273 -0.44416183 -0.44416183 1.0000000 0.996123924
#> S 0.9340785 0.97362851 -0.51061064 -0.51061064 0.9961239 1.000000000
#> K 0.5285648 0.43768785 -0.82677590 -0.82677590 0.1834886 0.259769880
#> I 0.5730763 0.64516910 -0.07538090 -0.07538090 0.7851674 0.746803970
#> D 0.2588002 0.16627627 -0.63870930 -0.63870930 -0.0791790 -0.004908985
#> Vm 0.1563947 0.07335163 -0.52170702 -0.52170702 -0.1467519 -0.078662406
#> Maas -0.9006288 -0.95073229 0.43657828 0.43657828 -0.9995399 -0.996308267
#> lgV0 0.6046049 0.70596093 0.01960454 0.01960454 0.8869502 0.848760632
#> K I D Vm Maas lgV0
#> TTR 0.5285648 0.5730763 0.258800178 0.15639469 -0.90062881 0.60460492
#> C 0.4376878 0.6451691 0.166276272 0.07335163 -0.95073229 0.70596093
#> R -0.8267759 -0.0753809 -0.638709303 -0.52170702 0.43657828 0.01960454
#> CTTR -0.8267759 -0.0753809 -0.638709303 -0.52170702 0.43657828 0.01960454
#> U 0.1834886 0.7851674 -0.079179004 -0.14675186 -0.99953989 0.88695019
#> S 0.2597699 0.7468040 -0.004908985 -0.07866241 -0.99630827 0.84876063
#> K 1.0000000 -0.3876328 0.953404869 0.90358783 -0.18557369 -0.21560361
#> I -0.3876328 1.0000000 -0.640859479 -0.71366061 -0.77965243 0.83088176
#> D 0.9534049 -0.6408595 1.000000000 0.98788292 0.07446776 -0.41006639
#> Vm 0.9035878 -0.7136606 0.987882916 1.00000000 0.14075816 -0.42393863
#> Maas -0.1855737 -0.7796524 0.074467756 0.14075816 1.00000000 -0.89052779
#> lgV0 -0.2156036 0.8308818 -0.410066390 -0.42393863 -0.89052779 1.00000000
Or in a visual form, using GGally::ggcorr
We can add the document level results to our corpus with the use of the docvars
function. The unlist
function creates a vector from our dataframe which is the original output of the textstat_lexdiv
function. After that we can merge it into our dfm as another covariate.
unga_dfm_lexdiv <- unga_dfm
cttr_score <- unlist(textstat_lexdiv(unga_dfm_lexdiv, measure = "CTTR")[,2])
docvars(unga_dfm_lexdiv, "cttr") <- cttr_score
docvars(unga_dfm_lexdiv)
#> potus party year cttr
#> 1 clinton dem 1993 17.06607
#> 2 clinton dem 1997 15.89893
#> 3 hwbush rep 1990 14.69553
#> 4 obama dem 2009 17.34122
#> 5 obama dem 2013 17.79937
#> 6 trump rep 2017 16.93697
#> 7 wbush rep 2001 14.90223
#> 8 wbush rep 2005 15.00893
Following up on this, let’s check if lexical diversity translates into complexity as well. We’ll use the textstat_readability
function which implements (amongst many other) the Flesch reading ease score and the Flesch-Kincaid readability score. (for all of the implemented variations, see the function documentation)
We will use the corpus as input for the function.
unga_corpus %>%
textstat_readability(measure = "Flesch.Kincaid")
#> document Flesch.Kincaid
#> 1 clinton93 11.771735
#> 2 clinton97 13.161281
#> 3 hwbush90 11.439626
#> 4 obama09 10.923339
#> 5 obama13 12.965902
#> 6 trump17 11.621825
#> 7 wbush01 8.924193
#> 8 wbush05 11.586582
Let’s add the readability scores to our corpus as well. As the corpus is not a matrix object, we don’t need to use the unlist
trick to get our result added as additional document level variable.
docvars(unga_corpus, "f_k") <- textstat_readability(unga_corpus, measure = "Flesch.Kincaid")[,2]
docvars(unga_corpus)
#> potus party year f_k
#> 1 clinton dem 1993 11.771735
#> 2 clinton dem 1997 13.161281
#> 3 hwbush rep 1990 11.439626
#> 4 obama dem 2009 10.923339
#> 5 obama dem 2013 12.965902
#> 6 trump rep 2017 11.621825
#> 7 wbush rep 2001 8.924193
#> 8 wbush rep 2005 11.586582
As a bonus this allows for easier visualization.
unga_corpus_df <- docvars(unga_corpus)
ggplot(unga_corpus_df, aes(year, f_k, color = party)) +
geom_point(size = 2) +
geom_line(aes(linetype = party), size = 1) +
geom_text(aes(label = potus), color = "black", nudge_y = 0.15) +
scale_x_continuous(breaks = unga_corpus_df$year) +
theme_minimal()
We see considerable variation for within party and even within presidents in terms of readability. Let’s dig deeper and see how similar these speeches are actually? For this we’ll use the textstat_dist
and textstat_simil
functions. As these methods require a matrix input we’ll plug in our dfm.
unga_dfm %>%
dfm_weight("prop") %>%
textstat_simil(margin = "documents", method = "jaccard")
#> textstat_simil object; method = "jaccard"
#> clinton93 clinton97 hwbush90 obama09 obama13 trump17 wbush01 wbush05
#> clinton93 1.000 0.230 0.193 0.224 0.222 0.225 0.179 0.212
#> clinton97 0.230 1.000 0.209 0.217 0.213 0.209 0.182 0.227
#> hwbush90 0.193 0.209 1.000 0.200 0.188 0.191 0.171 0.198
#> obama09 0.224 0.217 0.200 1.000 0.275 0.220 0.194 0.224
#> obama13 0.222 0.213 0.188 0.275 1.000 0.241 0.197 0.211
#> trump17 0.225 0.209 0.191 0.220 0.241 1.000 0.192 0.223
#> wbush01 0.179 0.182 0.171 0.194 0.197 0.192 1.000 0.221
#> wbush05 0.212 0.227 0.198 0.224 0.211 0.223 0.221 1.000
For distance based measures, we can experiment with the Euclidean distance.
unga_dfm %>%
textstat_dist(margin = "documents", method = "euclidean")
#> textstat_dist object; method = "euclidean"
#> clinton93 clinton97 hwbush90 obama09 obama13 trump17 wbush01 wbush05
#> clinton93 0 92.2 94.4 101.3 110.9 105.8 110.6 98.2
#> clinton97 92.2 0 66.2 90.0 102.9 98.5 77.8 74.5
#> hwbush90 94.4 66.2 0 98.6 103.6 102.1 79.9 81.0
#> obama09 101.3 90.0 98.6 0 97.4 100.1 100.4 93.0
#> obama13 110.9 102.9 103.6 97.4 0 104.2 109.9 109.4
#> trump17 105.8 98.5 102.1 100.1 104.2 0 104.1 92.4
#> wbush01 110.6 77.8 79.9 100.4 109.9 104.1 0 78.5
#> wbush05 98.2 74.5 81.0 93.0 109.4 92.4 78.5 0
To visualize the distance between the documents, we can draw a dendogram that shows various possible pairings. (more on this when we look at classification). To be able to get this plot, we convert our output to a distance class, and then perform hierarchical clustering, and plot the output of this whole chain.
unga_dist <- unga_dfm %>%
textstat_dist(margin = "documents", method = "euclidean")
plot(hclust(as.dist(unga_dist)))
By modifying the margin =
option in both similarity and distance measures we can look at feature similarity as well. To have a select few features, we define that selection in the y =
option.
unga_dfm %>%
textstat_simil(y = unga_dfm[, c("peace")], margin = "features", method = "correlation") %>%
head(n = 10)
#> peace
#> thank -0.33562905
#> much -0.26351734
#> mr 0.09932929
#> president 0.35195425
#> let 0.10812531
#> first 0.27233231
#> congratulate -0.19841354
#> election -0.23600191
#> general -0.58679602
#> assembly -0.29286832
Finally, let’s check how the US presidents speak about terror in the UN.
kwic(unga_corpus, pattern = "terror*", valuetype = "glob", window = 5, case_insensitive = TRUE) %>%
head(20)
#>
#> [clinton93, 798] too many nations. And | terrorism |
#> [clinton93, 864] determined to see that such | terrorists |
#> [clinton93, 1923] oppose everywhere extremism that produces | terrorism |
#> [clinton93, 2603] raising the danger of nuclear | terrorism |
#> [clinton97, 443] to an unholy axis of | terrorists |
#> [clinton97, 681] coalitions with zero tolerance for | terrorism |
#> [clinton97, 1437] interconnected groups that traffic in | terror |
#> [clinton97, 1557] the emerging international consensus that | terrorism |
#> [clinton97, 1575] more countries sign on, | terrorists |
#> [hwbush90, 1370] plundered Kuwait. It has | terrorized |
#> [hwbush90, 2478] to the environment, on | terrorism |
#> [obama09, 1336] status quo. Extremists sowing | terror |
#> [obama09, 2166] of wars and acts of | terror |
#> [obama09, 2835] forge lasting partnerships to target | terrorists |
#> [obama13, 561] to other countries and trying | terrorists |
#> [obama13, 677] In Kenya, we've seen | terrorists |
#> [obama13, 1889] become a safe haven for | terrorists |
#> [obama13, 2351] economy. We will dismantle | terrorist |
#> [obama13, 2384] address the root causes of | terror |
#> [obama13, 2396] defend the United States against | terrorist |
#>
#> , which has taken so
#> are brought to justice.
#> and hate. And we
#> for all nations. We
#> , drug traffickers, and
#> , corruption, crime,
#> , organized crime, and
#> is always a crime and
#> will have fewer places to
#> innocent civilians. It has
#> , on managing the debt
#> in pockets of the world
#> on a scale that we
#> , share intelligence, coordinate
#> in courts of law,
#> target innocent civilians in a
#> . I welcome the influence
#> networks that threaten our people
#> . But when it's necessary
#> attack, we will take