User Tools

Site Tools


03_processing:05_normalization

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
03_processing:05_normalization [2022/01/04 13:30]
Simone Ueberwasser
03_processing:05_normalization [2022/06/27 09:21] (current)
Line 56: Line 56:
  
 In German, nouns are spelled with a starting upper case letter. Independent of the capitalization in the SMS layer, nouns are in upper case in the normalized layer in an attempt to support a PoS tagger in recognizing nouns. In German, nouns are spelled with a starting upper case letter. Independent of the capitalization in the SMS layer, nouns are in upper case in the normalized layer in an attempt to support a PoS tagger in recognizing nouns.
-//Spelling//+====Spelling====
  
 Assuming that spelling is unorthodox in SMS all over, we decided to adjust spelling to what is found in a dictionary for a lemma on an according syntactical position. In German, e.g., there is an definite neutral article das and the conjunction dass (e.g. //er sagte, dass er komme// ('he said **that** he would come')). Irrespective of the spelling used in the SMS, we applied das for an article and dass for a conjunction. The same rule was applied for other homophonous words, too. Assuming that spelling is unorthodox in SMS all over, we decided to adjust spelling to what is found in a dictionary for a lemma on an according syntactical position. In German, e.g., there is an definite neutral article das and the conjunction dass (e.g. //er sagte, dass er komme// ('he said **that** he would come')). Irrespective of the spelling used in the SMS, we applied das for an article and dass for a conjunction. The same rule was applied for other homophonous words, too.
Line 70: Line 70:
  
 Digits were not modified, i.e. //3// remained //3// and //three// remained //three//. There is, however, one exception to this rule. Where digits were combined with letters, they were written out in the normalization, thus, //4tel// became //Viertel// ('quarter'). Digits were not modified, i.e. //3// remained //3// and //three// remained //three//. There is, however, one exception to this rule. Where digits were combined with letters, they were written out in the normalization, thus, //4tel// became //Viertel// ('quarter').
-==== Special rules for Swiss German dialect ====+===== Special rules for Swiss German dialect =====
  
  
-=== Helvetisms ===+==== Helvetisms ====
  
  
 Helvetisms, i.e. lemmas that belong to Standard German in Switzerland according to the [[https://www.degruyter.com/document/doi/10.1515/9783110905816/html|Variantenwörterbuch]] were taken over into the normalized layer in their standardized spelling, even though they might not be understandable to a reader from Northern Germany. Helvetisms, i.e. lemmas that belong to Standard German in Switzerland according to the [[https://www.degruyter.com/document/doi/10.1515/9783110905816/html|Variantenwörterbuch]] were taken over into the normalized layer in their standardized spelling, even though they might not be understandable to a reader from Northern Germany.
  
-===No equivalent in standard German===+====No equivalent in standard German====
  
 Some words in Swiss German dialect do not have equivalents in Standard German, e.g //luege// ('to look') or //gumpe// ('to jump'). Where ever possible, we used lemmas with a similar sound to replace these expressions in the normalized level, provided the semantics are somehow similar. Following this idea, //luege// was transcribed as lugen (according to the Duden: "rural for// to pry// "). Where this type of approach was not possible, we normalized to the Standard lemma that is closest in its meaning, e.g. //gumpe// became //springen//. Some words in Swiss German dialect do not have equivalents in Standard German, e.g //luege// ('to look') or //gumpe// ('to jump'). Where ever possible, we used lemmas with a similar sound to replace these expressions in the normalized level, provided the semantics are somehow similar. Following this idea, //luege// was transcribed as lugen (according to the Duden: "rural for// to pry// "). Where this type of approach was not possible, we normalized to the Standard lemma that is closest in its meaning, e.g. //gumpe// became //springen//.
  
 A special situation in this context is a verbal particle that can be realized as //go, ga, goge// and similar forms. This particle is syntactically compulsory in the dialect but has no equivalent in Standard German and is semantically empty. We decided to normalize this particle to //go// and to take it over into the normalized layer in this form. A special situation in this context is a verbal particle that can be realized as //go, ga, goge// and similar forms. This particle is syntactically compulsory in the dialect but has no equivalent in Standard German and is semantically empty. We decided to normalize this particle to //go// and to take it over into the normalized layer in this form.
-=== Prepositions ===+==== Prepositions ====
  
  
 Quite regularly, the Swiss German dialect does not use the same prepositions as Standard German. In this case, we used the same preposition in the normalized layer as in the SMS layer (albeit adjusted in spelling where needed). E.g. //i gane uf Bärn// ('I go to Bern'), which should be //ich gehe nach Bern// in Standard German became //ich gehe auf Bern//. Quite regularly, the Swiss German dialect does not use the same prepositions as Standard German. In this case, we used the same preposition in the normalized layer as in the SMS layer (albeit adjusted in spelling where needed). E.g. //i gane uf Bärn// ('I go to Bern'), which should be //ich gehe nach Bern// in Standard German became //ich gehe auf Bern//.
-=== Diminutives ===+==== Diminutives ====
  
  
 In Standard German a diminutive is normally realized as //-chen//, while the dialect only know a diminutive in //-li//. For some lemmas and in some (older) variants of German, a //-lein// diminutive exist(ed). Accordingly, we decided to apply this //-lein// form whenever a diminutive was used in the SMS. E.g. //s'chindli// ('the little child') became das //Kindlein// even though it sounds slightly archaic. In Standard German a diminutive is normally realized as //-chen//, while the dialect only know a diminutive in //-li//. For some lemmas and in some (older) variants of German, a //-lein// diminutive exist(ed). Accordingly, we decided to apply this //-lein// form whenever a diminutive was used in the SMS. E.g. //s'chindli// ('the little child') became das //Kindlein// even though it sounds slightly archaic.
-=== Imperatives ===+==== Imperatives ====
  
  
 In Standard German, the verb of an imperative can take a short or a long form: //schlaf gut// vs. //schlafe gut//. For the dialect, this is not the case, there is only a short form. Accordingly, the normalization always uses the short normalized form. In Standard German, the verb of an imperative can take a short or a long form: //schlaf gut// vs. //schlafe gut//. For the dialect, this is not the case, there is only a short form. Accordingly, the normalization always uses the short normalized form.
  
-==== Special rules for other languages ====+===== Special rules for other languages =====
  
  
 You find more information for languages other than German in the documentations written in the original language for {{ :03_processing:normalization_fra.pdf |French}} and {{ :03_processing:normalization_ita.pdf |Italian}} and written in German for {{ :03_processing:normalization_roh.pdf |Romansh}}. You find more information for languages other than German in the documentations written in the original language for {{ :03_processing:normalization_fra.pdf |French}} and {{ :03_processing:normalization_ita.pdf |Italian}} and written in German for {{ :03_processing:normalization_roh.pdf |Romansh}}.
03_processing/05_normalization.1641299431.txt.gz · Last modified: 2022/06/27 09:21 (external edit)