library(quanteda)
library(quanteda.textmodels)
library(readtext)
library(dplyr)
library(stringr)

We are using the UNGA data again.

We carry out the same pre-processing steps that we did previously.

unga_texts <- readtext("data/unga/*.txt")

unga_texts$doc_id <- str_extract(unga_texts$doc_id, "[^\\.]*")

unga_texts$potus <- str_sub(unga_texts$doc_id, end = -3)

unga_texts$year <- str_sub(unga_texts$doc_id, start = -2) %>% 
    str_c("-01-01") %>% 
    lubridate::ymd() %>% 
    lubridate::year()

1 Wordfish

The wordfish model is implemented in quanteda as textmodel_wordfish(). We need a dfm for it.

unga_dfm <- corpus(unga_texts) %>% 
    tokens(remove_symbols = TRUE,
           remove_numbers = TRUE,
           remove_punct = TRUE) %>% 
    tokens_tolower() %>% 
    tokens_remove(stopwords("english")) %>%
    dfm()

Then we fit our wordfish. Let’s suppose that this is Trump and Obama in their first term.

unga_dfm@docvars
#>    docname_    docid_ segid_   potus year
#> 1 clinton93 clinton93      1 clinton 1993
#> 2 clinton97 clinton97      1 clinton 1997
#> 3  hwbush90  hwbush90      1  hwbush 1990
#> 4   obama09   obama09      1   obama 2009
#> 5   obama13   obama13      1   obama 2013
#> 6   trump17   trump17      1   trump 2017
#> 7   wbush01   wbush01      1   wbush 2001
#> 8   wbush05   wbush05      1   wbush 2005

unga_wf <- textmodel_wordfish(unga_dfm, dir = c(4,6))
summary(unga_wf)
#> 
#> Call:
#> textmodel_wordfish.dfm(x = unga_dfm, dir = c(4, 6))
#> 
#> Estimated Document Positions:
#>              theta      se
#> clinton93 -1.10003 0.02500
#> clinton97 -1.04103 0.03215
#> hwbush90  -1.17473 0.03290
#> obama09   -0.04328 0.02957
#> obama13    0.66015 0.02287
#> trump17    0.96070 0.02114
#> wbush01    1.34601 0.02235
#> wbush05    0.39220 0.03251
#> 
#> Estimated Feature Scores:
#>       thank    much      mr president     let   first congratulate election
#> beta 0.3811 -0.8840 -0.1289   -0.4142 -0.8387 -0.8058       -2.437  -0.7821
#> psi  0.7574  0.6827  0.5391    1.4066  1.2329  1.1856       -2.841  -0.6615
#>      general assembly secretary-general distinguished delegates  guests  great
#> beta -0.3119  -0.8518           -0.9805       -0.7391   -0.0721 -0.3539 -0.100
#> psi   0.8770   0.7921            0.1260       -0.1757   -0.3822 -0.7842  1.492
#>        honor  address  stand chamber symbolizes    20th century darkest crises
#> beta  0.6090 -0.09688 0.1621 -0.2343     -1.885 -1.4412 -1.1927  -2.535 -1.013
#> psi  -0.7775  0.60108 0.9184 -0.5558     -3.012 -0.8568  0.8885  -2.534 -1.683
#>      brightest aspirations    come american    born founding
#> beta    -2.437     -0.4204 -0.4238   0.1184 -0.7860 0.048803
#> psi     -2.841     -0.3859  1.1177   1.4311 -0.8861 0.005328

Let’s plot it

textplot_scale1d(unga_wf)

We can also plot the features and highlight a selected sample of them. We use the margin argument to switch to features instead of the aggregated document level plot, like above. In the below plot, the Estimated beta is the weight of the given word, while the Estimated psi is the word fixed effects. For the interpretation it means that more frequent words should appear in each text because they do not have a political meaning (such as prepositions). So while a common word might have a high fixed effect, it has a weight of zero because of it’s lack of political meaning.

textplot_scale1d(unga_wf, margin = "features",
                 highlighted = c("terror","sovereignity", "islam", "war", "nuclear", "iran"),
                 highlighted_color = "orangered2")

1.1 Wordscores

We set the reference scores to NA except of Trump and Obama.

docvars(unga_dfm, "reference_score") <- NA

docvars(unga_dfm, "reference_score")[4] <- 1
docvars(unga_dfm, "reference_score")[6] <- -1

And we fit the model

unga_ws <- textmodel_wordscores(unga_dfm, y = docvars(unga_dfm, "reference_score"), scale = c("linear"),
smooth = 0)


summary(unga_ws, 10)
#> 
#> Call:
#> textmodel_wordscores.dfm(x = unga_dfm, y = docvars(unga_dfm, 
#>     "reference_score"), scale = c("linear"), smooth = 0)
#> 
#> Reference Document Statistics:
#>           score total min max   mean median
#> clinton93    NA  2543   0  45 0.6359      0
#> clinton97    NA  1572   0  35 0.3931      0
#> hwbush90     NA  1429   0  34 0.3573      0
#> obama09       1  2475   0  40 0.6189      0
#> obama13      NA  2860   0  31 0.7152      0
#> trump17      -1  2377   0  55 0.5944      0
#> wbush01      NA  1276   0  23 0.3191      0
#> wbush05      NA  1759   0  42 0.4399      0
#> 
#> Wordscores:
#> (showing first 10 elements)
#>             thank              much                mr         president 
#>           -0.7856           -0.5150           -0.0202           -0.1626 
#>               let             first          election           general 
#>           -0.5694           -0.1109           -1.0000           -1.0000 
#>          assembly secretary-general 
#>            0.3153            1.0000
unga_ws_pred <- predict(unga_ws, newdata = unga_dfm)
#> Warning: 2034 features in newdata not used in prediction.

textplot_scale1d(unga_ws_pred)