Differences

This shows you the differences between two versions of the page.

--- 03_processing:03_languages [2022/01/04 10:39] – Simone Ueberwasser
+++ 03_processing:03_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1
@@ Line 61: / Line 61: @@
 On the other hand, abbreviations that could not be resolved such as //iLSi// were ignored.
 The following abbreviations were considered:
+^Token^Interpretation^Language tagging based on:^
+|amt|Amo te|nonce borrowing: Portuguese|
+|bb|bébé|only considered in French SMS|
+|bjs|beijinhos|nonce borrowing: Portuguese|
+|btw|by the way|nonce borrowing English|
+|bmmw| |bist mir mega wichtig|
+|bx|bisous|nonce borrowing French|
+|cu|see you|nonce borrowing English|
+|fb|Facebook|ignored, since it's a proper name|
+|Ga li gr| |Ganz liebe Grüsse / Ganz liebi Grüessli|
+|glg, GlG| |Ganz liebe Grüsse / Ganz liebi Grüessli|
+|Grz|Greetings (> greetz)|nonce borrowing English|
+|hdl| |Hab' dich lieb / Ha di lieb|
+|IldvgHunvvvm buöIugsnz vdbddadddkukdggf, KK| |I lieb di vo ganzem Herze un no vill vill vill meh. [???] ...knuddel und küss di ganz ganz fescht, Kuss Kuss.|
+|ka| |keine Ahnung / Kei Ahnig|
+|Kikoo|coucou|nonce borrowing French|
+|ld| |liebe dich / lieb di|
+|ldsmf4iue|liebe dich so mega fest 4 immer und ewig|nonce borrowing English|
+|lymtwcs|love you more than words can say|nonce borrowing English|
+|lysm|Love you so much|nonce borrowing English|
+|mdr|mot de rire|nonce borrowing French|
+|piz|Peace|nonce borrowing English|
+|t.b.c.|to be continued / confirmed|nonce borrowing English|
+|tgif|thank god it's friday|nonce borrowing English|
+|tk(...)| |tausend Küsse (dein/e XY) / tuusig Küss|
+|tqm|Te quiero mucho|nonce borrowing Spanish|
+|tvtb|ti voglio tanto bene|nonce borrowing Italian|
+|wdnv| |will dich nicht verlieren / will di nöd verlüre|
+|we|Wochenende / week-end|No tagging, since it can be either language.
-Token	Interpretation	Language tagging based on:
-amt	Amo te	nonce borrowing: Portuguese
-bb	bébé	only considered in French SMS
-bjs	beijinhos	nonce borrowing: Portuguese
-btw	by the way	nonce borrowing English
-bmmw	bist mir mega wichtig
-bx	bisous	nonce borrowing French
-cu	see you	nonce borrowing English
-fb	Facebook	ignored, since it's a proper name
-Ga li gr		Ganz liebe Grüsse / Ganz liebi Grüessli
-glg, GlG		Ganz liebe Grüsse / Ganz liebi Grüessli
-Grz	Greetings (> greetz)	nonce borrowing English
-hdl		Hab' dich lieb / Ha di lieb
-IldvgHunvvvm buöIugsnz vdbddadddkukdggf, KK		I lieb di vo ganzem Herze un no vill vill vill meh. [???] ...knuddel und küss di ganz ganz fescht, Kuss Kuss.
-ka		keine Ahnung / Kei Ahnig
-Kikoo	coucou	nonce borrowing French
-ld		liebe dich / lieb di
-ldsmf4iue	liebe dich so mega fest 4 immer und ewig	nonce borrowing English
-lymtwcs	love you more than words can say	nonce borrowing English
-lysm	Love you so much	nonce borrowing English
-mdr	mot de rire	nonce borrowing French
-piz	Peace	nonce borrowing English
-t.b.c.	to be continued / confirmed	nonce borrowing English
-tgif	thank god it's friday	nonce borrowing English
-tk(...)		tausend Küsse (dein/e XY) / tuusig Küss
-tqm	Te quiero mucho	nonce borrowing Spanish
-tvtb	ti voglio tanto bene	nonce borrowing Italian
-wdnv		will dich nicht verlieren / will di nöd verlüre
-we	Wochenende / week-end	No tagging, since it can be either language.
 Information used:
-http://linguistesblogueurs.blogspot.com/2005/11/les-abrviations-sms.html
+  * http://linguistesblogueurs.blogspot.com/2005/11/les-abrviations-sms.html
-http://spanish.about.com/od/writtenspanish/a/sms.htm
+  * http://spanish.about.com/od/writtenspanish/a/sms.htm
-http://www.urbandictionary.com/
+  * http://www.urbandictionary.com/
-Phonetic writing
+====Phonetic writing====
-Phonetic writing, whether based on letters or on digits, occurs often in SMS. For the language tagging, we take their phonetic value and then tag them as normal words. E.g. 4 in an English SMS is treated as 'for'. If phonetic writing is ambivalent (e.g. n8 can mean 'night' or 'nuit' or 'Nacht'), it is considered to be in the main language of the SMS. If, on the other hand 4 appears in a French SMS and cannot stand for 'quattre' but only for 'four' and thus phonetically for for, it is considered to be an English borrowing.
+Phonetic writing, whether based on letters or on digits, occurs often in SMS. For the language tagging, we take their phonetic value and then tag them as normal words. E.g. //4// in an English SMS is treated as 'for'. If phonetic writing is ambivalent (e.g. //n8// can mean 'night' or 'nuit' or 'Nacht'), it is considered to be in the main language of the SMS. If, on the other hand //4// appears in a French SMS and cannot stand for 'quattre' but only for 'four' and thus phonetically for //for//, it is considered to be an English borrowing.
-Unorthodox spelling
+====Unorthodox spelling====
 Spelling that deviates from prescribed spelling is not considered to make decisions about which language/variety a token comes from. This is especially important for the distinction between Standard German and dialect. Here, a /ß/ is not taken as a indication for an influence from Germany but rather as a typing variant on the mobile phone's keyboard.
-Proper names
+====Proper names====
 All kind of proper names were ignored when counting the tokens to decide on the main language. This includes brand names, people's names, names of places etc.
-Homophony vs. homography
+====Homophony vs. homography====
-If a word's spelling implies a difference in pronunciation, the spelling is considered to define the language. E.g. mam in a German SMS is considered to be German, while mom is considered to be English because of the different pronunciation. If, however, the difference is only on the graphical level, it is neglected. E.g. hey, hei, ey are the same, in this case an English borrowing.
+If a word's spelling implies a difference in pronunciation, the spelling is considered to define the language. E.g. //mam// in a German SMS is considered to be German, while //mom// is considered to be English because of the different pronunciation. If, however, the difference is only on the graphical level, it is neglected. E.g. //hey, hei, ey// are the same, in this case an English borrowing.
-Compounds
+====Compounds====
-To define the language of a compound consisting of elements from two different languages, the compound was taken apart and each part was treated individually. E.g. Delegiertenmeeting in an English SMS would have been tagged as a German nonce borrowing, because of Delegierten-. If the same compound had appeared in a French SMS, it would have been marked as containing each an English and a German nonce borrowing.
+To define the language of a compound consisting of elements from two different languages, the compound was taken apart and each part was treated individually. E.g. //Delegiertenmeeting// in an English SMS would have been tagged as a German nonce borrowing, because of //Delegierten//-. If the same compound had appeared in a French SMS, it would have been marked as containing each an English and a German nonce borrowing.
-Main language
+====Main language====
 Defining the main language as the language with the most words sounds simple enough. However, it is not always that simple. The following problems occurred:
-Abbreviations
+===Abbreviations===
-SMS is a text type with lots of abbreviations, from cu ('see you') to ihdmfg ('i ha di mega fescht gärn'). When defining the main language, we counted words. Whenever possible, abbreviations where resolved and every word in the abbreviation (i.e. cu is two words) was counted individually. Thus, the SMS Tgif : D btw, wrist roller drbi? looked like German Dialect on the first look (roller can be German Dialect, drbi is definitely German Dialect). However, resolving those abbreviations, the SMS turned into Thank god it [is] Friday : D by the way, wrist roller drbi and is thus English.
+SMS is a text type with lots of abbreviations, from //cu// ('see you') to //ihdmfg// ('i ha di mega fescht gärn'). When defining the main language, we counted words. Whenever possible, abbreviations where resolved and every word in the abbreviation (i.e. cu is two words) was counted individually. Thus, the SMS "Tgif : D btw, wrist roller drbi?" looked like German Dialect on the first look (roller can be German Dialect, drbi is definitely German Dialect). However, resolving those abbreviations, the SMS turned into "Thank god it [is] Friday : D by the way, wrist roller drbi" and is thus English.
-Equal number of words
+===Equal number of words====
 In case of equity (e.g. 2 words English, two words French), we tagged both languages as main language, because we want those SMS to be found by researchers of either language.
 In a case, where e.g. one word was clearly dialect, one word clearly Standard and all other words homograph, we decided the SMS to be Standard.
-Compounds
+===Compounds===
-If a compound deriving from two languages is needed to define the main language of an SMS, the compound is considered to be one word and the head is used to define the language of the whole word. E.g. in Delegiertenmeeting there is a German and an English component. However, since meeting is the head, the whole word is considered to be English.
+If a compound deriving from two languages is needed to define the main language of an SMS, the compound is considered to be one word and the head is used to define the language of the whole word. E.g. in //Delegiertenmeeting// there is a German and an English component. However, since meeting is the head, the whole word is considered to be English.
-No recognizable words
+===No recognizable words===
 Some SMS come with only punctuation or emoticons. Their main language was tagged as 'unknown'.