====== Part of speech tagging ======
[[https://en.wikipedia.org/wiki/Part-of-speech_tagging|Wikipedia]] defines PoS tagging as follows: "In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context, i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. " In this corpus, we applied PoS tagging to the German, French and Italian parts using Helmut Schmid's [[https://www.cis.lmu.de/~schmid/tools/TreeTagger/|TreeTagger]] For both varieties of German (i.e. dialectal and non dialectal), there is also a sub-corpus available that was annotated with the [[https://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/|RFTagger]]. For Romansh, unfortunately, there is no parameter file available for TreeTagger and there are in fact no other tools available for this language, either.

===== German (both dialectal and non-dialectal) =====
==== TreeTagger ====

  * PoS tagging was applied to the normalised level of each SMS, and each SMS was tagged as one unit.
  * TreeTagger was used, applying a tailor-made German parameter file (courtesy of Helmut Schmid)
  * The [[http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf|STTS]] tagset was used.
  * The tagger's lexicon was systematically supplemented with borrowings and proper nouns (thanks to Andrea Suter).
  * The tag PTKINF was added for infinitive particle (go, goge etc.) for the german dialect.
  * The resulting sub-corora are: deu-tagged and gsw-tagged

==== RFTagger ====
  * The same varieties of German were also tagged with the RFTagger, resulting in the sub-corpora deu-rftagged and gsw-rftagged

===== French =====


  *   PoS tagging was applied to the normalised level of each SMS, and SMS was tagged as one unit.
  *   TreeTagger was used out of the box.
  *   Achim Stein's [[https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/french-tagset.html|TagSet]] for French was used.
  *   The tags DET:DEM and DET:IND were added.
===== Italian =====


  * PoS tagging was applied to the normalised level of each SMS, and SMS was tagged as one unit.
  * TreeTagger was used out of the box.
  * Achim Stein's [[https://www.cis.lmu.de/~schmid/tools/TreeTagger/data/italian-tagset.txt|TagSet]] for Italian was used.
  * The tag ADJ:poss was added.
===== Precision =====


Our test gave the following precision for the respective sub-corpora:
  * gsw: 2'734 tokens checked: 96.3% correct
  * deu: 2'922 tokens checked: 93.1% correct
  * fra: 3'133 tokens checked: 94.6% correct
  * ita: 2527 tokens checked: 90.5% correct