03_processing:03_languages
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
03_processing:03_languages [2022/01/04 10:46] – Simone Ueberwasser | 03_processing:03_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 66: | Line 66: | ||
|bjs|beijinhos|nonce borrowing: Portuguese| | |bjs|beijinhos|nonce borrowing: Portuguese| | ||
|btw|by the way|nonce borrowing English| | |btw|by the way|nonce borrowing English| | ||
- | |bmmw|bist mir mega wichtig|| | + | |bmmw| |bist mir mega wichtig| |
|bx|bisous|nonce borrowing French| | |bx|bisous|nonce borrowing French| | ||
|cu|see you|nonce borrowing English| | |cu|see you|nonce borrowing English| | ||
Line 73: | Line 73: | ||
|glg, GlG| |Ganz liebe Grüsse / Ganz liebi Grüessli| | |glg, GlG| |Ganz liebe Grüsse / Ganz liebi Grüessli| | ||
|Grz|Greetings (> greetz)|nonce borrowing English| | |Grz|Greetings (> greetz)|nonce borrowing English| | ||
- | |hdl||Hab' | + | |hdl| |Hab' dich lieb / Ha di lieb| |
- | |IldvgHunvvvm buöIugsnz vdbddadddkukdggf, | + | |IldvgHunvvvm buöIugsnz vdbddadddkukdggf, |
- | |ka||keine Ahnung / Kei Ahnig| | + | |ka| |keine Ahnung / Kei Ahnig| |
|Kikoo|coucou|nonce borrowing French| | |Kikoo|coucou|nonce borrowing French| | ||
- | |ld||liebe dich / lieb di| | + | |ld| |liebe dich / lieb di| |
|ldsmf4iue|liebe dich so mega fest 4 immer und ewig|nonce borrowing English| | |ldsmf4iue|liebe dich so mega fest 4 immer und ewig|nonce borrowing English| | ||
|lymtwcs|love you more than words can say|nonce borrowing English| | |lymtwcs|love you more than words can say|nonce borrowing English| | ||
Line 85: | Line 85: | ||
|t.b.c.|to be continued / confirmed|nonce borrowing English| | |t.b.c.|to be continued / confirmed|nonce borrowing English| | ||
|tgif|thank god it's friday|nonce borrowing English| | |tgif|thank god it's friday|nonce borrowing English| | ||
- | |tk(...)||tausend Küsse (dein/e XY) / tuusig Küss| | + | |tk(...)| |tausend Küsse (dein/e XY) / tuusig Küss| |
|tqm|Te quiero mucho|nonce borrowing Spanish| | |tqm|Te quiero mucho|nonce borrowing Spanish| | ||
|tvtb|ti voglio tanto bene|nonce borrowing Italian| | |tvtb|ti voglio tanto bene|nonce borrowing Italian| | ||
- | |wdnv||will dich nicht verlieren / will di nöd verlüre| | + | |wdnv| |will dich nicht verlieren / will di nöd verlüre| |
|we|Wochenende / week-end|No tagging, since it can be either language. | |we|Wochenende / week-end|No tagging, since it can be either language. | ||
+ | |||
Information used: | Information used: | ||
- | http:// | + | * http:// |
- | http:// | + | |
- | http:// | + | |
- | Phonetic writing | + | ====Phonetic writing==== |
- | Phonetic writing, whether based on letters or on digits, occurs often in SMS. For the language tagging, we take their phonetic value and then tag them as normal words. E.g. 4 in an English SMS is treated as ' | + | Phonetic writing, whether based on letters or on digits, occurs often in SMS. For the language tagging, we take their phonetic value and then tag them as normal words. E.g. //4// in an English SMS is treated as ' |
- | Unorthodox spelling | + | ====Unorthodox spelling==== |
Spelling that deviates from prescribed spelling is not considered to make decisions about which language/ | Spelling that deviates from prescribed spelling is not considered to make decisions about which language/ | ||
- | Proper names | + | ====Proper names==== |
All kind of proper names were ignored when counting the tokens to decide on the main language. This includes brand names, people' | All kind of proper names were ignored when counting the tokens to decide on the main language. This includes brand names, people' | ||
- | Homophony vs. homography | + | ====Homophony vs. homography==== |
- | If a word's spelling implies a difference in pronunciation, | + | If a word's spelling implies a difference in pronunciation, |
- | Compounds | + | ====Compounds==== |
- | To define the language of a compound consisting of elements from two different languages, the compound was taken apart and each part was treated individually. E.g. Delegiertenmeeting in an English SMS would have been tagged as a German nonce borrowing, because of Delegierten-. If the same compound had appeared in a French SMS, it would have been marked as containing each an English and a German nonce borrowing. | + | To define the language of a compound consisting of elements from two different languages, the compound was taken apart and each part was treated individually. E.g. //Delegiertenmeeting// in an English SMS would have been tagged as a German nonce borrowing, because of //Delegierten//-. If the same compound had appeared in a French SMS, it would have been marked as containing each an English and a German nonce borrowing. |
- | Main language | + | ====Main language==== |
Defining the main language as the language with the most words sounds simple enough. However, it is not always that simple. The following problems occurred: | Defining the main language as the language with the most words sounds simple enough. However, it is not always that simple. The following problems occurred: | ||
- | Abbreviations | + | ===Abbreviations=== |
- | SMS is a text type with lots of abbreviations, | + | SMS is a text type with lots of abbreviations, |
- | Equal number of words | + | ===Equal number of words==== |
In case of equity (e.g. 2 words English, two words French), we tagged both languages as main language, because we want those SMS to be found by researchers of either language. | In case of equity (e.g. 2 words English, two words French), we tagged both languages as main language, because we want those SMS to be found by researchers of either language. | ||
In a case, where e.g. one word was clearly dialect, one word clearly Standard and all other words homograph, we decided the SMS to be Standard. | In a case, where e.g. one word was clearly dialect, one word clearly Standard and all other words homograph, we decided the SMS to be Standard. | ||
- | Compounds | + | ===Compounds=== |
- | If a compound deriving from two languages is needed to define the main language of an SMS, the compound is considered to be one word and the head is used to define the language of the whole word. E.g. in Delegiertenmeeting there is a German and an English component. However, since meeting is the head, the whole word is considered to be English. | + | If a compound deriving from two languages is needed to define the main language of an SMS, the compound is considered to be one word and the head is used to define the language of the whole word. E.g. in //Delegiertenmeeting// there is a German and an English component. However, since meeting is the head, the whole word is considered to be English. |
- | No recognizable words | + | ===No recognizable words=== |
Some SMS come with only punctuation or emoticons. Their main language was tagged as ' | Some SMS come with only punctuation or emoticons. Their main language was tagged as ' |
03_processing/03_languages.1641289614.txt.gz · Last modified: 2022/06/27 09:21 (external edit)