library(readtext)
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(GGally)
library(quanteda)

Let’s work with the UNGA texts again for this session again. Load it and let’s describe it! We’ll work with the packages that we are already familiar with from our intro to R session.

1 Data prep

unga_texts <- readtext("data/unga/*.txt")

head(unga_texts)
#> readtext object consisting of 6 documents and 0 docvars.
#> # Description: df[,2] [6 x 2]
#>   doc_id        text               
#> * <chr>         <chr>              
#> 1 clinton93.txt "\"Thank you \"..."
#> 2 clinton97.txt "\"Mr. Presid\"..."
#> 3 hwbush90.txt  "\"Mr. Presid\"..."
#> 4 obama09.txt   "\"Mr. Presid\"..."
#> 5 obama13.txt   "\"Mr. Presid\"..."
#> 6 trump17.txt   "\"Mr. Secret\"..."

For practice, we carry out the same few modification as we did in the last hands on session. We also add a party dummy to our data. As opposed to the approach we took in the previous session we’ll use the mutate function instead of the $ method.


unga_texts <- unga_texts %>% 
    mutate(doc_id = str_extract(doc_id, "[^\\.]*"),
           potus = str_sub(doc_id, end = -3),
           party = if_else(potus %in% c("obama", "clinton"), "dem", "rep"))



unga_texts$year <- str_sub(unga_texts$doc_id, start = -2) %>% 
    str_c("-01-01") %>% 
    lubridate::ymd() %>% 
    lubridate::year()

glimpse(unga_texts)
#> Rows: 8
#> Columns: 5
#> $ doc_id <chr> "clinton93", "clinton97", "hwbush90", "obama09", "obama13", ...
#> $ text   <chr> "Thank you very much. Mr. President, let me first congratula...
#> $ potus  <chr> "clinton", "clinton", "hwbush", "obama", "obama", "trump", "...
#> $ party  <chr> "dem", "dem", "rep", "dem", "dem", "rep", "rep", "rep"
#> $ year   <dbl> 1993, 1997, 1990, 2009, 2013, 2017, 2001, 2005

2 Simple descriptives

Now we will create our corpus object in quanteda and prepare some summary statistics. Tokens are individual words, types are unique words in our corpus.

unga_corpus <- corpus(unga_texts)

summary(unga_corpus)
#> Corpus consisting of 8 documents, showing 8 documents:
#> 
#>       Text Types Tokens Sentences   potus party year
#>  clinton93  1438   5291       227 clinton   dem 1993
#>  clinton97  1061   3215       120 clinton   dem 1997
#>   hwbush90   952   3064       123  hwbush   rep 1990
#>    obama09  1432   5533       246   obama   dem 2009
#>    obama13  1566   6128       236   obama   dem 2013
#>    trump17  1364   5136       220   trump   rep 2017
#>    wbush01   914   2812       173   wbush   rep 2001
#>    wbush05  1072   3575       160   wbush   rep 2005

If we want to subset our corpus, we should use the corpus_subset function.

unga_dem <- corpus_subset(unga_corpus, potus %in% c("clinton", "obama"))

summary(unga_dem)
#> Corpus consisting of 4 documents, showing 4 documents:
#> 
#>       Text Types Tokens Sentences   potus party year
#>  clinton93  1438   5291       227 clinton   dem 1993
#>  clinton97  1061   3215       120 clinton   dem 1997
#>    obama09  1432   5533       246   obama   dem 2009
#>    obama13  1566   6128       236   obama   dem 2013

We can add more descriptive statistics as well. The trick is that the summary function will create a dataframe object for us, which we can then treat as such and use our data manipulation tools.

summary(unga_corpus) %>% 
    group_by(party) %>% 
    summarise(mean_wordcount = mean(Tokens), std_dev = sd(Tokens), min_wordc = min(Tokens), max_wordc = max(Tokens))
#> # A tibble: 2 x 5
#>   party mean_wordcount std_dev min_wordc max_wordc
#>   <chr>          <dbl>   <dbl>     <int>     <int>
#> 1 dem            5042.   1268.      3215      6128
#> 2 rep            3647.   1042.      2812      5136

3 Collocations with quanteda

With the textstat_collocations functions we can have two general approach. Feed the function a corpus or the tokens that we created (and with stopwords removed.) With the whole corpus without any pre-processing, we get the following result.

unga_corpus %>% 
    textstat_collocations() %>% 
    head(n = 10)
#>       collocation count count_nested length   lambda        z
#> 1  united nations   115            0      2 5.205054 33.35593
#> 2   united states    96            0      2 6.643355 28.66361
#> 3         we must    76            0      2 4.062952 23.81878
#> 4           it is    52            0      2 3.962046 22.08843
#> 5       those who    30            0      2 5.945595 21.65530
#> 6         we will    73            0      2 3.086902 21.21462
#> 7    human rights    27            0      2 6.589469 20.99931
#> 8       the world   120            0      2 3.088370 20.52294
#> 9         we have    58            0      2 3.140151 19.24971
#> 10          a new    47            0      2 3.818535 18.63462

After tokenization and removing our stopwords:

unga_corpus %>% 
    tokens() %>% 
    tokens_select(pattern = stopwords("en"), selection = "remove") %>%
    textstat_collocations(size = c(2:3)) %>% 
    head(n = 10)
#>                collocation count count_nested length   lambda        z
#> 1           united nations   115          109      2 4.525644 28.61105
#> 2            united states    96           91      2 5.992541 24.89221
#> 3             human rights    27           27      2 5.971306 18.53176
#> 4          nuclear weapons    17           17      2 4.975837 15.10074
#> 5  international community    13           13      2 5.524524 13.49280
#> 6                   let us    14           14      2 4.891058 13.44072
#> 7             every nation    15           14      2 4.337865 13.34576
#> 8                years ago    17           17      2 8.337951 13.06237
#> 9         general assembly    14           13      2 8.876684 12.70576
#> 10           work together    12           12      2 4.589589 12.63272

A third approach is see if in a weighted dfm we have particularly highly weighted bigrams or trigrams. This approach seem to give us cruder and less refined collocations.

unga_corpus %>% 
    tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE) %>% 
    tokens_select(pattern = stopwords("en"), selection = "remove") %>%
    tokens_ngrams(n = 2:3) %>% 
    dfm() %>% 
    dfm_tfidf(scheme_tf = "prop") %>% 
    textstat_frequency(n = 10, force = TRUE)
#>                    feature   frequency rank docfreq group
#> 1          new_partnership 0.001581594    1       1   all
#> 2             every_nation 0.001524471    2       4   all
#> 3                    u.n_s 0.001352867    3       3   all
#> 4                   let_us 0.001341727    4       3   all
#> 5      monterrey_consensus 0.001280979    5       1   all
#> 6      partnership_nations 0.001265275    6       1   all
#> 7                see_world 0.001265275    6       1   all
#> 8  new_partnership_nations 0.001265275    6       1   all
#> 9         u.n_peacekeeping 0.001243192    9       1   all
#> 10          september_11th 0.001115575   10       2   all

4 Lexical diversity

Now we are wondering how the lexical diversity differs between some of our documents. To check the implemented lexical diversity approaches in the quanteda package, let’s examine the textstat_lexdiv function. The data input needs to be a dfm, so prepare that as the first step. Again, the result is a data frame, which we can treat accordingly and order our results decreasingly.

unga_dfm <- unga_corpus %>% 
    tokens(remove_punct = TRUE, remove_separators = TRUE, remove_hyphens = TRUE) %>% 
    dfm(remove = stopwords("en"))
#> Warning: 'remove_hyphens' is deprecated, use 'split_hyphens' instead.

unga_dfm %>% 
    textstat_lexdiv(measure = "CTTR") %>% 
    arrange(desc(CTTR))
#>    document     CTTR
#> 1   obama13 17.79937
#> 2   obama09 17.34122
#> 3 clinton93 17.06607
#> 4   trump17 16.93697
#> 5 clinton97 15.89893
#> 6   wbush05 15.00893
#> 7   wbush01 14.90223
#> 8  hwbush90 14.69553

If we are still undiceded on what diversity measure we should use, the measure = all will give us the result for all the implemented measures.

unga_dfm %>% 
    textstat_lexdiv(measure = "all")
#>    document       TTR         C        R     CTTR        U         S        K
#> 1 clinton93 0.4778518 0.9058603 24.13507 17.06607 36.18781 0.9193388 27.87049
#> 2 clinton97 0.5642317 0.9223511 22.48448 15.89893 41.22210 0.9305242 28.71822
#> 3  hwbush90 0.5488145 0.9174496 20.78261 14.69553 38.23784 0.9250464 36.19023
#> 4   obama09 0.4913689 0.9091406 24.52418 17.34122 37.38056 0.9220947 23.94006
#> 5   obama13 0.4697075 0.9051027 25.17212 17.79937 36.44132 0.9196394 19.96512
#> 6   trump17 0.4903604 0.9083734 23.95249 16.93697 36.86341 0.9210476 30.07204
#> 7   wbush01 0.5890625 0.9260305 21.07494 14.90223 42.00666 0.9322163 29.50439
#> 8   wbush05 0.5045198 0.9085209 21.22584 15.00893 35.50507 0.9185612 31.35114
#>           I           D         Vm      Maas     lgV0    lgeV0
#> 1  87.83314 0.002395985 0.04434754 0.1662336 7.285586 16.77568
#> 2 126.50741 0.002243512 0.04190168 0.1557525 7.641484 17.59517
#> 3  93.06822 0.002923712 0.04846004 0.1617161 7.279024 16.76057
#> 4 109.90947 0.001993361 0.03971161 0.1635600 7.413833 17.07098
#> 5 120.36517 0.001648897 0.03542912 0.1656544 7.361389 16.95023
#> 6  85.82445 0.002589178 0.04639508 0.1647033 7.337348 16.89487
#> 7 139.34216 0.002170885 0.04030111 0.1542911 7.623214 17.55310
#> 8  89.31000 0.002571596 0.04489202 0.1678243 7.062165 16.26124

How much are these measures correlated with each other?

div_df <- unga_dfm %>% 
    textstat_lexdiv(measure = "all")


cor(div_df[,2:13])
#>             TTR           C           R        CTTR          U            S
#> TTR   1.0000000  0.99084932 -0.78388824 -0.78388824  0.9035902  0.934078544
#> C     0.9908493  1.00000000 -0.69305774 -0.69305774  0.9523527  0.973628513
#> R    -0.7838882 -0.69305774  1.00000000  1.00000000 -0.4441618 -0.510610645
#> CTTR -0.7838882 -0.69305774  1.00000000  1.00000000 -0.4441618 -0.510610645
#> U     0.9035902  0.95235273 -0.44416183 -0.44416183  1.0000000  0.996123924
#> S     0.9340785  0.97362851 -0.51061064 -0.51061064  0.9961239  1.000000000
#> K     0.5285648  0.43768785 -0.82677590 -0.82677590  0.1834886  0.259769880
#> I     0.5730763  0.64516910 -0.07538090 -0.07538090  0.7851674  0.746803970
#> D     0.2588002  0.16627627 -0.63870930 -0.63870930 -0.0791790 -0.004908985
#> Vm    0.1563947  0.07335163 -0.52170702 -0.52170702 -0.1467519 -0.078662406
#> Maas -0.9006288 -0.95073229  0.43657828  0.43657828 -0.9995399 -0.996308267
#> lgV0  0.6046049  0.70596093  0.01960454  0.01960454  0.8869502  0.848760632
#>               K          I            D          Vm        Maas        lgV0
#> TTR   0.5285648  0.5730763  0.258800178  0.15639469 -0.90062881  0.60460492
#> C     0.4376878  0.6451691  0.166276272  0.07335163 -0.95073229  0.70596093
#> R    -0.8267759 -0.0753809 -0.638709303 -0.52170702  0.43657828  0.01960454
#> CTTR -0.8267759 -0.0753809 -0.638709303 -0.52170702  0.43657828  0.01960454
#> U     0.1834886  0.7851674 -0.079179004 -0.14675186 -0.99953989  0.88695019
#> S     0.2597699  0.7468040 -0.004908985 -0.07866241 -0.99630827  0.84876063
#> K     1.0000000 -0.3876328  0.953404869  0.90358783 -0.18557369 -0.21560361
#> I    -0.3876328  1.0000000 -0.640859479 -0.71366061 -0.77965243  0.83088176
#> D     0.9534049 -0.6408595  1.000000000  0.98788292  0.07446776 -0.41006639
#> Vm    0.9035878 -0.7136606  0.987882916  1.00000000  0.14075816 -0.42393863
#> Maas -0.1855737 -0.7796524  0.074467756  0.14075816  1.00000000 -0.89052779
#> lgV0 -0.2156036  0.8308818 -0.410066390 -0.42393863 -0.89052779  1.00000000

Or in a visual form, using GGally::ggcorr

ggcorr(div_df[,2:13], label = TRUE)

We can add the document level results to our corpus with the use of the docvars function. The unlist function creates a vector from our dataframe which is the original output of the textstat_lexdiv function. After that we can merge it into our dfm as another covariate.

unga_dfm_lexdiv <- unga_dfm

cttr_score <- unlist(textstat_lexdiv(unga_dfm_lexdiv, measure = "CTTR")[,2])

docvars(unga_dfm_lexdiv, "cttr") <- cttr_score

docvars(unga_dfm_lexdiv)
#>     potus party year     cttr
#> 1 clinton   dem 1993 17.06607
#> 2 clinton   dem 1997 15.89893
#> 3  hwbush   rep 1990 14.69553
#> 4   obama   dem 2009 17.34122
#> 5   obama   dem 2013 17.79937
#> 6   trump   rep 2017 16.93697
#> 7   wbush   rep 2001 14.90223
#> 8   wbush   rep 2005 15.00893

5 Readability and complexity

Following up on this, let’s check if lexical diversity translates into complexity as well. We’ll use the textstat_readability function which implements (amongst many other) the Flesch reading ease score and the Flesch-Kincaid readability score. (for all of the implemented variations, see the function documentation)

We will use the corpus as input for the function.

unga_corpus %>% 
    textstat_readability(measure = "Flesch.Kincaid")
#>    document Flesch.Kincaid
#> 1 clinton93      11.771735
#> 2 clinton97      13.161281
#> 3  hwbush90      11.439626
#> 4   obama09      10.923339
#> 5   obama13      12.965902
#> 6   trump17      11.621825
#> 7   wbush01       8.924193
#> 8   wbush05      11.586582

Let’s add the readability scores to our corpus as well. As the corpus is not a matrix object, we don’t need to use the unlist trick to get our result added as additional document level variable.

docvars(unga_corpus, "f_k") <- textstat_readability(unga_corpus, measure = "Flesch.Kincaid")[,2]

docvars(unga_corpus)
#>     potus party year       f_k
#> 1 clinton   dem 1993 11.771735
#> 2 clinton   dem 1997 13.161281
#> 3  hwbush   rep 1990 11.439626
#> 4   obama   dem 2009 10.923339
#> 5   obama   dem 2013 12.965902
#> 6   trump   rep 2017 11.621825
#> 7   wbush   rep 2001  8.924193
#> 8   wbush   rep 2005 11.586582

As a bonus this allows for easier visualization.

unga_corpus_df <- docvars(unga_corpus)

ggplot(unga_corpus_df, aes(year, f_k, color = party)) +
    geom_point(size = 2) +
    geom_line(aes(linetype = party), size = 1) +
    geom_text(aes(label = potus), color = "black", nudge_y = 0.15) +
    scale_x_continuous(breaks = unga_corpus_df$year) +
    theme_minimal()

6 Similarity measures

We see considerable variation for within party and even within presidents in terms of readability. Let’s dig deeper and see how similar these speeches are actually? For this we’ll use the textstat_dist and textstat_simil functions. As these methods require a matrix input we’ll plug in our dfm.

unga_dfm %>% 
    dfm_weight("prop") %>% 
    textstat_simil(margin = "documents", method = "jaccard") 
#> textstat_simil object; method = "jaccard"
#>           clinton93 clinton97 hwbush90 obama09 obama13 trump17 wbush01 wbush05
#> clinton93     1.000     0.230    0.193   0.224   0.222   0.225   0.179   0.212
#> clinton97     0.230     1.000    0.209   0.217   0.213   0.209   0.182   0.227
#> hwbush90      0.193     0.209    1.000   0.200   0.188   0.191   0.171   0.198
#> obama09       0.224     0.217    0.200   1.000   0.275   0.220   0.194   0.224
#> obama13       0.222     0.213    0.188   0.275   1.000   0.241   0.197   0.211
#> trump17       0.225     0.209    0.191   0.220   0.241   1.000   0.192   0.223
#> wbush01       0.179     0.182    0.171   0.194   0.197   0.192   1.000   0.221
#> wbush05       0.212     0.227    0.198   0.224   0.211   0.223   0.221   1.000

For distance based measures, we can experiment with the Euclidean distance.

unga_dfm %>%
    textstat_dist(margin = "documents", method = "euclidean")
#> textstat_dist object; method = "euclidean"
#>           clinton93 clinton97 hwbush90 obama09 obama13 trump17 wbush01 wbush05
#> clinton93         0      92.2     94.4   101.3   110.9   105.8   110.6    98.2
#> clinton97      92.2         0     66.2    90.0   102.9    98.5    77.8    74.5
#> hwbush90       94.4      66.2        0    98.6   103.6   102.1    79.9    81.0
#> obama09       101.3      90.0     98.6       0    97.4   100.1   100.4    93.0
#> obama13       110.9     102.9    103.6    97.4       0   104.2   109.9   109.4
#> trump17       105.8      98.5    102.1   100.1   104.2       0   104.1    92.4
#> wbush01       110.6      77.8     79.9   100.4   109.9   104.1       0    78.5
#> wbush05        98.2      74.5     81.0    93.0   109.4    92.4    78.5       0

To visualize the distance between the documents, we can draw a dendogram that shows various possible pairings. (more on this when we look at classification). To be able to get this plot, we convert our output to a distance class, and then perform hierarchical clustering, and plot the output of this whole chain.

unga_dist <- unga_dfm %>%
    textstat_dist(margin = "documents", method = "euclidean")

plot(hclust(as.dist(unga_dist)))

By modifying the margin = option in both similarity and distance measures we can look at feature similarity as well. To have a select few features, we define that selection in the y = option.

unga_dfm %>% 
    textstat_simil(y = unga_dfm[, c("peace")], margin = "features", method = "correlation") %>% 
    head(n = 10)
#>                    peace
#> thank        -0.33562905
#> much         -0.26351734
#> mr            0.09932929
#> president     0.35195425
#> let           0.10812531
#> first         0.27233231
#> congratulate -0.19841354
#> election     -0.23600191
#> general      -0.58679602
#> assembly     -0.29286832

7 KWIC exploration

Finally, let’s check how the US presidents speak about terror in the UN.

kwic(unga_corpus, pattern = "terror*", valuetype = "glob", window = 5, case_insensitive = TRUE) %>% 
    head(20)
#>                                                                            
#>   [clinton93, 798]                     too many nations. And | terrorism  |
#>   [clinton93, 864]               determined to see that such | terrorists |
#>  [clinton93, 1923] oppose everywhere extremism that produces | terrorism  |
#>  [clinton93, 2603]             raising the danger of nuclear | terrorism  |
#>   [clinton97, 443]                      to an unholy axis of | terrorists |
#>   [clinton97, 681]        coalitions with zero tolerance for | terrorism  |
#>  [clinton97, 1437]     interconnected groups that traffic in |   terror   |
#>  [clinton97, 1557] the emerging international consensus that | terrorism  |
#>  [clinton97, 1575]                   more countries sign on, | terrorists |
#>   [hwbush90, 1370]                  plundered Kuwait. It has | terrorized |
#>   [hwbush90, 2478]                    to the environment, on | terrorism  |
#>    [obama09, 1336]             status quo. Extremists sowing |   terror   |
#>    [obama09, 2166]                       of wars and acts of |   terror   |
#>    [obama09, 2835]      forge lasting partnerships to target | terrorists |
#>     [obama13, 561]             to other countries and trying | terrorists |
#>     [obama13, 677]                      In Kenya, we've seen | terrorists |
#>    [obama13, 1889]                   become a safe haven for | terrorists |
#>    [obama13, 2351]                economy. We will dismantle | terrorist  |
#>    [obama13, 2384]                address the root causes of |   terror   |
#>    [obama13, 2396]          defend the United States against | terrorist  |
#>                                   
#>  , which has taken so             
#>  are brought to justice.          
#>  and hate. And we                 
#>  for all nations. We              
#>  , drug traffickers, and          
#>  , corruption, crime,             
#>  , organized crime, and           
#>  is always a crime and            
#>  will have fewer places to        
#>  innocent civilians. It has       
#>  , on managing the debt           
#>  in pockets of the world          
#>  on a scale that we               
#>  , share intelligence, coordinate 
#>  in courts of law,                
#>  target innocent civilians in a   
#>  . I welcome the influence        
#>  networks that threaten our people
#>  . But when it's necessary        
#>  attack, we will take