03_processing:05_normalization
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
03_processing:04_normalization [2022/01/04 13:14] – Simone Ueberwasser | 03_processing:05_normalization [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 4: | Line 4: | ||
This normalised corpus, however, is not only interesting for automated processing but also for lexical research. The simple first person pronoun //ich// (' | This normalised corpus, however, is not only interesting for automated processing but also for lexical research. The simple first person pronoun //ich// (' | ||
- | The path from a non-standard SMS corpus to a normalised corpus and a PoS tagging is steep and involves many steps. Some of them were performed automatically, | + | The path from a non-standard SMS corpus to a normalised corpus and a PoS tagging is steep and involves many steps. Some of them were performed automatically, |
When applying the normalization, | When applying the normalization, | ||
Line 30: | Line 30: | ||
in a special layer. This annotation was performed automatically where ever possible and then corrected by the student assistants. | in a special layer. This annotation was performed automatically where ever possible and then corrected by the student assistants. | ||
- | Elements that cannot be identified | + | ====Elements that cannot be identified==== |
+ | |||
+ | In SMS texts, you find lots of words that are not understandable, | ||
+ | ==== Foreign material ==== | ||
- | In SMS texts, you find lots of words that are not understandable, | ||
- | Foreign material | ||
If tokens come from a foreign language that can be expected as known (i.e. Western European language) the spelling was adjusted to the original spelling where possible. If this token appeared in an inflected form, the inflection was added to the corrected spelling. | If tokens come from a foreign language that can be expected as known (i.e. Western European language) the spelling was adjusted to the original spelling where possible. If this token appeared in an inflected form, the inflection was added to the corrected spelling. | ||
- | kuul --> cool | + | |
- | en kuuli idee --> eine coole Idee | + | en kuuli idee --> eine coole Idee |
- | One to many or many to one | + | |
- | As has been said above, some tokens from the SMS layer had to be pulled together to present only one token in the normalised layer (e.g. [pomme] [de] [terre] --> [pomme de terre]), while others had to be taken apart ([hani] --> [han] [i]). These steps were noted down by our student assistants and executed by the computational linguists. | + | ==== One to many or many to one ==== |
- | Reconstructions and interpretations | + | |
+ | |||
+ | As has been said above, some tokens from the SMS layer had to be pulled together to present only one token in the normalised layer (e.g. //[pomme] [de] [terre]// --> | ||
+ | ==== Reconstructions and interpretations | ||
+ | |||
+ | |||
+ | We decided not to reconstruct or interpret language where it is not clearly recognizable. This goes especially for the case system in the Swiss German dialect and thus we use an according example here. In Standard German, at least in the masculine form, a nominative can be distinguished from an accusative based on the morphology (cf. //den Mann// vs //der Mann// in the following example). In the Swiss German dialect, on the other hand, this distinction is not visible (cf. //dr Maa// in both situations in the following example). Now, do the Swiss in fact use a nominative as an object or is this just an accusative that is homophone to the nominative? We don't know and - more importantly - we do not want to interpret. Accordingly, | ||
+ | |||
+ | {{: | ||
- | We decided not to reconstruct or interpret language where it is not clearly recognizable. This goes especially for the case system in the Swiss German dialect and thus we use an according example here. In Standard German, at least in the masculine form, a nominative can be distinguished from an accusative based on the morphology (cf. den Mann vs der Mann in the following example). In the Swiss German dialect, on the other hand, this distinction is not visible (cf. dr Maa in both situations in the following example). Now, do the Swiss in fact use a nominative as an object or is this just an accusative that is homophone to the nominative? We don't know and - more importantly - we do not want to interpret. Accordingly, | ||
- | Case.png | ||
In a parallel approach, we do not reconstruct or interpret language in other situations or languages neither. | In a parallel approach, we do not reconstruct or interpret language in other situations or languages neither. | ||
- | Capitalisation | + | ==== Capitalisation |
In German, nouns are spelled with a starting upper case letter. Independent of the capitalization in the SMS layer, nouns are in upper case in the normalized layer in an attempt to support a PoS tagger in recognizing nouns. | In German, nouns are spelled with a starting upper case letter. Independent of the capitalization in the SMS layer, nouns are in upper case in the normalized layer in an attempt to support a PoS tagger in recognizing nouns. | ||
- | Spelling | + | ====Spelling==== |
+ | |||
+ | Assuming that spelling is unorthodox in SMS all over, we decided to adjust spelling to what is found in a dictionary for a lemma on an according syntactical position. In German, e.g., there is an definite neutral article das and the conjunction dass (e.g. //er sagte, dass er komme// ('he said **that** he would come' | ||
+ | ==== Abbreviations ==== | ||
- | Assuming that spelling is unorthodox in SMS all over, we decided to adjust spelling to what is found in a dictionary for a lemma on an according syntactical position. In German, e.g., there is an definite neutral article das and the conjunction dass (e.g. er sagte, dass er komme ('he said that he would come' | ||
- | Abbreviations | ||
When abbreviations are used in SMS, three different situations can be distinguished: | When abbreviations are used in SMS, three different situations can be distinguished: | ||
- | The abbreviation cannot be decoded, eg. tkdn. In this case, the abbreviation is taken over from the SMS layer into the normalized layer as is and it is marked as an abbreviation. | + | * The abbreviation cannot be decoded, eg. //tkdn//. In this case, the abbreviation is taken over from the SMS layer into the normalized layer as is and it is marked as an abbreviation. |
- | The abbreviation can be decoded, e.g. cu for see you. In this case, the abbreviation is decomposed in the normalized layer, e.g cu becomes see you. This type of abbreviation is kept in the language, in which it is abbreviated, | + | |
- | Abbreviations that stand for a brand or other type of name (e.g. IBM) were kept as they are. | + | |
- | Digits | + | ==== Digits |
+ | |||
+ | |||
+ | Digits were not modified, i.e. //3// remained //3// and //three// remained //three//. There is, however, one exception to this rule. Where digits were combined with letters, they were written out in the normalization, | ||
+ | ===== Special rules for Swiss German dialect ===== | ||
+ | |||
+ | |||
+ | ==== Helvetisms ==== | ||
+ | |||
+ | |||
+ | Helvetisms, i.e. lemmas that belong to Standard German in Switzerland according to the [[https:// | ||
+ | |||
+ | ====No equivalent in standard German==== | ||
+ | |||
+ | Some words in Swiss German dialect do not have equivalents in Standard German, e.g //luege// ('to look') or //gumpe// ('to jump' | ||
+ | |||
+ | A special situation in this context is a verbal particle that can be realized as //go, ga, goge// and similar forms. This particle is syntactically compulsory in the dialect but has no equivalent in Standard German and is semantically empty. We decided to normalize this particle to //go// and to take it over into the normalized layer in this form. | ||
+ | ==== Prepositions ==== | ||
- | Digits were not modified, i.e. 3 remained 3 and three remained three. There is, however, one exception to this rule. Where digits were combined with letters, they were written out in the normalization, | + | Quite regularly, the Swiss German dialect does not use the same prepositions as Standard German. In this case, we used the same preposition |
- | Special rules for Swiss German dialect | + | ==== Diminutives ==== |
- | Helvetisms | ||
- | Helvetisms, i.e. lemmas | + | In Standard German a diminutive is normally realized as //-chen//, while the dialect only know a diminutive in //-li//. For some lemmas |
- | No equivalent in standard German | + | ==== Imperatives ==== |
- | Some words in Swiss German dialect do not have equivalents in Standard German, e.g luege ('to look') or gumpe ('to jump' | ||
- | A special situation in this context is a verbal particle that can be realized as go, ga, goge and similar forms. This particle is syntactically compulsory in the dialect but has no equivalent in Standard German and is semantically empty. We decided to normalize this particle to go and to take it over into the normalized layer in this form. | ||
- | Prepositions | ||
- | Quite regularly, the Swiss German | + | In Standard German, the verb of an imperative can take a short or a long form: //schlaf gut// vs. //schlafe gut//. For the dialect, this is not the case, there is only a short form. Accordingly, |
- | Diminutives | + | |
- | In Standard German a diminutive is normally realized as -chen, while the dialect only know a diminutive in -li. For some lemmas and in some (older) variants of German, a -lein diminutive exist(ed). Accordingly, | + | ===== Special rules for other languages ===== |
- | Imperatives | + | |
- | In Standard German, the verb of an imperative can take a short or a long form: schlaf gut vs. schlafe gut. For the dialect, this is not the case, there is only a short form. Accordingly, | ||
- | Special rules for other languages | ||
- | You find more information for languages other than German in the documaentations | + | You find more information for languages other than German in the documentations |
03_processing/05_normalization.1641298489.txt.gz · Last modified: 2022/06/27 09:21 (external edit)