We are using the UNGA data again.
We carry out the same pre-processing steps that we did previously.
unga_texts <- readtext("data/unga/*.txt")
unga_texts$doc_id <- str_extract(unga_texts$doc_id, "[^\\.]*")
unga_texts$potus <- str_sub(unga_texts$doc_id, end = -3)
unga_texts$year <- str_sub(unga_texts$doc_id, start = -2) %>%
str_c("-01-01") %>%
lubridate::ymd() %>%
lubridate::year()
The wordfish model is implemented in quanteda as textmodel_wordfish()
. We need a dfm for it.
unga_dfm <- corpus(unga_texts) %>%
tokens(remove_symbols = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
dfm()
Then we fit our wordfish. Let’s suppose that this is Trump and Obama in their first term.
unga_dfm@docvars
#> docname_ docid_ segid_ potus year
#> 1 clinton93 clinton93 1 clinton 1993
#> 2 clinton97 clinton97 1 clinton 1997
#> 3 hwbush90 hwbush90 1 hwbush 1990
#> 4 obama09 obama09 1 obama 2009
#> 5 obama13 obama13 1 obama 2013
#> 6 trump17 trump17 1 trump 2017
#> 7 wbush01 wbush01 1 wbush 2001
#> 8 wbush05 wbush05 1 wbush 2005
unga_wf <- textmodel_wordfish(unga_dfm, dir = c(4,6))
summary(unga_wf)
#>
#> Call:
#> textmodel_wordfish.dfm(x = unga_dfm, dir = c(4, 6))
#>
#> Estimated Document Positions:
#> theta se
#> clinton93 -1.10003 0.02500
#> clinton97 -1.04103 0.03215
#> hwbush90 -1.17473 0.03290
#> obama09 -0.04328 0.02957
#> obama13 0.66015 0.02287
#> trump17 0.96070 0.02114
#> wbush01 1.34601 0.02235
#> wbush05 0.39220 0.03251
#>
#> Estimated Feature Scores:
#> thank much mr president let first congratulate election
#> beta 0.3811 -0.8840 -0.1289 -0.4142 -0.8387 -0.8058 -2.437 -0.7821
#> psi 0.7574 0.6827 0.5391 1.4066 1.2329 1.1856 -2.841 -0.6615
#> general assembly secretary-general distinguished delegates guests great
#> beta -0.3119 -0.8518 -0.9805 -0.7391 -0.0721 -0.3539 -0.100
#> psi 0.8770 0.7921 0.1260 -0.1757 -0.3822 -0.7842 1.492
#> honor address stand chamber symbolizes 20th century darkest crises
#> beta 0.6090 -0.09688 0.1621 -0.2343 -1.885 -1.4412 -1.1927 -2.535 -1.013
#> psi -0.7775 0.60108 0.9184 -0.5558 -3.012 -0.8568 0.8885 -2.534 -1.683
#> brightest aspirations come american born founding
#> beta -2.437 -0.4204 -0.4238 0.1184 -0.7860 0.048803
#> psi -2.841 -0.3859 1.1177 1.4311 -0.8861 0.005328
Let’s plot it
We can also plot the features and highlight a selected sample of them. We use the margin argument to switch to features instead of the aggregated document level plot, like above. In the below plot, the Estimated beta is the weight of the given word, while the Estimated psi is the word fixed effects. For the interpretation it means that more frequent words should appear in each text because they do not have a political meaning (such as prepositions). So while a common word might have a high fixed effect, it has a weight of zero because of it’s lack of political meaning.
textplot_scale1d(unga_wf, margin = "features",
highlighted = c("terror","sovereignity", "islam", "war", "nuclear", "iran"),
highlighted_color = "orangered2")
We set the reference scores to NA
except of Trump and Obama.
docvars(unga_dfm, "reference_score") <- NA
docvars(unga_dfm, "reference_score")[4] <- 1
docvars(unga_dfm, "reference_score")[6] <- -1
And we fit the model
unga_ws <- textmodel_wordscores(unga_dfm, y = docvars(unga_dfm, "reference_score"), scale = c("linear"),
smooth = 0)
summary(unga_ws, 10)
#>
#> Call:
#> textmodel_wordscores.dfm(x = unga_dfm, y = docvars(unga_dfm,
#> "reference_score"), scale = c("linear"), smooth = 0)
#>
#> Reference Document Statistics:
#> score total min max mean median
#> clinton93 NA 2543 0 45 0.6359 0
#> clinton97 NA 1572 0 35 0.3931 0
#> hwbush90 NA 1429 0 34 0.3573 0
#> obama09 1 2475 0 40 0.6189 0
#> obama13 NA 2860 0 31 0.7152 0
#> trump17 -1 2377 0 55 0.5944 0
#> wbush01 NA 1276 0 23 0.3191 0
#> wbush05 NA 1759 0 42 0.4399 0
#>
#> Wordscores:
#> (showing first 10 elements)
#> thank much mr president
#> -0.7856 -0.5150 -0.0202 -0.1626
#> let first election general
#> -0.5694 -0.1109 -1.0000 -1.0000
#> assembly secretary-general
#> 0.3153 1.0000
unga_ws_pred <- predict(unga_ws, newdata = unga_dfm)
#> Warning: 2034 features in newdata not used in prediction.
textplot_scale1d(unga_ws_pred)