Legal information

Part of speech tagging

Wikipedia defines PoS tagging as follows: "In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. " In this corpus, we applied PoS tagging to the German, French and Italian parts using Helmut Schmid's TreeTagger. For Romansh, unfortunately, there is no parameter file available for TreeTagger and there are in fact no other tools available for this language, either.

In our corpus, the PoS annotation is applied to the layer pos in Annis.


German (both dialectal and non-dialectal)

  • PoS tagging was applied to the normalised level of each SMS, and each SMS was tagged as one unit.
  • TreeTagger was used, applying a tailor-made German parameter file (courtesy of Helmut Schmid)
  • The STTS tagset was used.
  • The tagger's lexicon was systematically supplemented with borrowings and proper nouns (thanks to Andrea Suter).
  • The tag PTKINF was added for infinitive particle (go, goge etc.) for the german dialect.


French

  • PoS tagging was applied to the normalised level of each SMS, and SMS was tagged as one unit.
  • TreeTagger was used out of the box.
  • Achim Stein's TagSet for French was used.
  • The tags DET:DEM and DET:IND were added.


Italian

  • PoS tagging was applied to the normalised level of each SMS, and SMS was tagged as one unit.
  • TreeTagger was used out of the box.
  • Achim Stein's TagSet for Italian was used.
  • The tag ADJ:poss was added.



Precision

Our test gave the following precision for the respective sub-corpora:
  • gsw: 2'734 tokens checked: 96.3% correct
  • deu: 2'922 tokens checked: 93.1% correct
  • fra: 3'133 tokens checked: 94.6% correct
  • ita: 2527 tokens checked: 90.5% correct

On this page:


Other processing steps:
You might also be interested in:
Please don't forget to quote the corpus in your work.
Topic revision: r3 - 08 Aug 2016, SimoneUeberwasser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.


The corpora and documentation are licensed under the <strong>Creative Commons license: Attribution + NoncommercialThe corpora and documentation are licensed under the *Creative Commons license: Attribution + Noncommercial:
- Licensees may copy, distribute, display, and perform the work and make derivative works based on it only for noncommercial purposes.
- Licensees may copy, distribute, display and publish the work and make derivative works based on it only if they give the author or licensor the credits as follows:
Stark, Elisabeth; Ueberwasser, Simone; Ruef, Beni (2009-2014). Swiss SMS Corpus. University of Zurich. https://sms.linguistik.uzh.ch

Ideas, requests, problems regarding sms4science? Send feedback