Table of Contents

Language tagging

Types of taggings

Each SMS was tagged for the languages contained within. There are three possible tags:

Restrictions

Because whole SMS were tagged rather than individual words, a few restrictions apply:

Identifying languages

To define, whether a word is an established part of the main language, the following codices were used for the specific languages:

Normally, each SMS consist of exactly one main language. It can, however contain borrowings as well as nonce borrowings and both of them from different languages. In rare cases, an SMS can consist of two main languages. Look at the following example:

  1. Olla fratello!!! Come stai? Wie geht's dir so? Immer noch so lange am arbeiten wie früher? Ich hab endlich mein eigenes Restaurant und mucho travajo😉 aber macht mir extrem spass😉 allora amore, buona giornata und luegsch uf di, gäll😉 peace

This original SMS was tagged as follows:

As can be seen from this example, Standard German and Swiss German Dialect are tagged independently, because most likely a lot of research in this field will be performed on this corpus. The same goes for all Swiss national languages (i.e. German, French, Italian, Romansh), where we differentiated between varieties. However, varieties in other languages (e.g. British and American English or Spanish from Spain and Spanish from South-American countries) were not taken into consideration.

Special tagging problems

Swiss German dialect vs. Standard German

Since there is no fixed spelling system nor codex for the Swiss German dialect, many words become homograph in Swiss German and in Standard German. The following rules have thus been fixed to assign a word to one or the other variety:

Internationalisms

Some words occur in different languages (e.g. OK or restaurant). Many of them derive from Greek or Latin, these two languages where thus ignored as donors. They were actually ignored, even if the word was only coined in the last two hundred years or so. Photograph, e.g. was coined by the inventors of the camera and could thus be considered to be German or French depending on the sources. However, since we care about the word only and not about the coining, that type of word creation was ignored altogether, instead, the word is considered to be Greek and thus ignored. In all other cases, i.e. if the word is not a borrowing from Greek or Latin, the following rules were applied to define a word as a borrowing:

Pseudo borrowings

There are words in a language, that clearly sound as if they were a borrowing, but are not known in the apparent donor language. An example would be handy, the German word for 'cell phone' or alles paletti (~ 'OK'), which imitates an non existent Italian word. These words are linguistically no borrowings. However, in the interest of possible research questions, we still considered them to be nonce borrowings.

Abbreviations

Abbreviations like tgif for 'thank god it's friday' where resolved whenever possible and thus considered for defining the main language (see below), but also to set borrowings or nonce borrowings. tgif would therefor in a German SMS be considered as an English nonce borrowing. On the other hand, abbreviations that could not be resolved such as iLSi were ignored. The following abbreviations were considered:

TokenInterpretationLanguage tagging based on:
amtAmo tenonce borrowing: Portuguese
bbbébéonly considered in French SMS
bjsbeijinhosnonce borrowing: Portuguese
btwby the waynonce borrowing English
bmmw bist mir mega wichtig
bxbisousnonce borrowing French
cusee younonce borrowing English
fbFacebookignored, since it's a proper name
Ga li gr Ganz liebe Grüsse / Ganz liebi Grüessli
glg, GlG Ganz liebe Grüsse / Ganz liebi Grüessli
GrzGreetings (> greetz)nonce borrowing English
hdl Hab' dich lieb / Ha di lieb
IldvgHunvvvm buöIugsnz vdbddadddkukdggf, KK I lieb di vo ganzem Herze un no vill vill vill meh. [???] …knuddel und küss di ganz ganz fescht, Kuss Kuss.
ka keine Ahnung / Kei Ahnig
Kikoocoucounonce borrowing French
ld liebe dich / lieb di
ldsmf4iueliebe dich so mega fest 4 immer und ewignonce borrowing English
lymtwcslove you more than words can saynonce borrowing English
lysmLove you so muchnonce borrowing English
mdrmot de rirenonce borrowing French
pizPeacenonce borrowing English
t.b.c.to be continued / confirmednonce borrowing English
tgifthank god it's fridaynonce borrowing English
tk(…) tausend Küsse (dein/e XY) / tuusig Küss
tqmTe quiero muchononce borrowing Spanish
tvtbti voglio tanto benenonce borrowing Italian
wdnv will dich nicht verlieren / will di nöd verlüre
weWochenende / week-end

Information used:

Phonetic writing

Phonetic writing, whether based on letters or on digits, occurs often in SMS. For the language tagging, we take their phonetic value and then tag them as normal words. E.g. 4 in an English SMS is treated as 'for'. If phonetic writing is ambivalent (e.g. n8 can mean 'night' or 'nuit' or 'Nacht'), it is considered to be in the main language of the SMS. If, on the other hand 4 appears in a French SMS and cannot stand for 'quattre' but only for 'four' and thus phonetically for for, it is considered to be an English borrowing.

Unorthodox spelling

Spelling that deviates from prescribed spelling is not considered to make decisions about which language/variety a token comes from. This is especially important for the distinction between Standard German and dialect. Here, a /ß/ is not taken as a indication for an influence from Germany but rather as a typing variant on the mobile phone's keyboard.

Proper names

All kind of proper names were ignored when counting the tokens to decide on the main language. This includes brand names, people's names, names of places etc.

Homophony vs. homography

If a word's spelling implies a difference in pronunciation, the spelling is considered to define the language. E.g. mam in a German SMS is considered to be German, while mom is considered to be English because of the different pronunciation. If, however, the difference is only on the graphical level, it is neglected. E.g. hey, hei, ey are the same, in this case an English borrowing.

Compounds

To define the language of a compound consisting of elements from two different languages, the compound was taken apart and each part was treated individually. E.g. Delegiertenmeeting in an English SMS would have been tagged as a German nonce borrowing, because of Delegierten-. If the same compound had appeared in a French SMS, it would have been marked as containing each an English and a German nonce borrowing.

Main language

Defining the main language as the language with the most words sounds simple enough. However, it is not always that simple. The following problems occurred:

Abbreviations

SMS is a text type with lots of abbreviations, from cu ('see you') to ihdmfg ('i ha di mega fescht gärn'). When defining the main language, we counted words. Whenever possible, abbreviations where resolved and every word in the abbreviation (i.e. cu is two words) was counted individually. Thus, the SMS "Tgif : D btw, wrist roller drbi?" looked like German Dialect on the first look (roller can be German Dialect, drbi is definitely German Dialect). However, resolving those abbreviations, the SMS turned into "Thank god it [is] Friday : D by the way, wrist roller drbi" and is thus English.

Equal number of words

In case of equity (e.g. 2 words English, two words French), we tagged both languages as main language, because we want those SMS to be found by researchers of either language. In a case, where e.g. one word was clearly dialect, one word clearly Standard and all other words homograph, we decided the SMS to be Standard.

Compounds

If a compound deriving from two languages is needed to define the main language of an SMS, the compound is considered to be one word and the head is used to define the language of the whole word. E.g. in Delegiertenmeeting there is a German and an English component. However, since meeting is the head, the whole word is considered to be English.

No recognizable words

Some SMS come with only punctuation or emoticons. Their main language was tagged as 'unknown'.