The sample contains 7 randomly sampled speeches from the Hungarian Prime Ministers after the 1989 democraticization. This version of the sample contains the doc_id and text variable without any preprocessing, thus it is ideal to showcase how to get metadata from filenames and how to clean the text variable. The cleaned version is the data_miniszterelnokok. This dataset is used in the 5th chapter of the book (https://tankonyv.poltextlab.com/leiro-stat.html).

data_miniszterelnokok_raw

Format

It is a data.frame, with 7 observation, 4 variables:

doc_id

A unique document id (filename in this case), with the '_' separator. The syntax: lastname_firstname_year.txt

text

The unprocessed speech text

Source

https://cap.tk.hu/en/dataoverview