====== Part of speech tagging ====== [[https://en.wikipedia.org/wiki/Part-of-speech_tagging|Wikipedia]] defines PoS tagging as follows: "In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context, i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. " In this corpus, we applied PoS tagging to the German, French and Italian parts using Helmut Schmid's [[https://www.cis.lmu.de/~schmid/tools/TreeTagger/|TreeTagger]] For both varieties of German (i.e. dialectal and non dialectal), there is also a sub-corpus available that was annotated with the [[https://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]. For Romansh, unfortunately, there is no parameter file available for TreeTagger and there are in fact no other tools available for this language, either. ===== German (both dialectal and non-dialectal) ===== ==== TreeTagger ==== * PoS tagging was applied to the normalised level of each SMS, and each SMS was tagged as one unit. * TreeTagger was used, applying a tailor-made German parameter file (courtesy of Helmut Schmid) * The [[http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf|STTS]] tagset was used. * The tagger's lexicon was systematically supplemented with borrowings and proper nouns (thanks to Andrea Suter). * The tag PTKINF was added for infinitive particle (go, goge etc.) for the german dialect. * The resulting sub-corora are: deu-tagged and gsw-tagged ==== RFTagger ==== * The same varieties of German were also tagged with the RFTagger, resulting in the sub-corpora deu-rftagged and gsw-rftagged ===== French ===== * PoS tagging was applied to the normalised level of each SMS, and SMS was tagged as one unit. * TreeTagger was used out of the box. * Achim Stein's [[https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/french-tagset.html|TagSet]] for French was used. * The tags DET:DEM and DET:IND were added. ===== Italian ===== * PoS tagging was applied to the normalised level of each SMS, and SMS was tagged as one unit. * TreeTagger was used out of the box. * Achim Stein's [[https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|TagSet]] for Italian was used. * The tag ADJ:poss was added. ===== Precision ===== Our test gave the following precision for the respective sub-corpora: * gsw: 2'734 tokens checked: 96.3% correct * deu: 2'922 tokens checked: 93.1% correct * fra: 3'133 tokens checked: 94.6% correct * ita: 2527 tokens checked: 90.5% correct