The dataset is the result of a part of speech analysis conducted with the Magyarlanc tool on a sample of 25 Hungarian parliamentary speeches. It is used in the 11th chapter of the textbook (https://tankonyv.poltextlab.com/nlp-ch.html).

data_parlspeech_magyarlanc

Format

It is a data.frame, with 17 870 observation, 4 variables:

token

The token created by magyarlanc.

lemma

The lemma created from the tokens by magyarlanc

POS_tag

The part of speech tag indicating the position of the token in the text.

morfologic_features

The morfologic features of the tokens

Source

https://cap.tk.hu/en/dataoverview

References

Zsibrita, János, Veronika Vincze, and Richárd Farkas (2013). Magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP, 2013: 763–71.