- DATA COLLECTION
- BROWSING THE CORPUS
- THE QUESTIONNAIRE
- DATA PROCESSING
- FACTS AND FIGURES
This is an old revision of the document!
The corpus all-tagged contains all SMS in all languages. Data for all languages except Romansh are tagged with TreeTagger.
Next to that, the following sub-corpora per language are available:
The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics.
Please keep in mind that you also see corpora with upper-case letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our WhatsApp project.
Next to the name of each sub-corpus, you see the number of SMS (marked as "Texts") and tokens. You can use these figures for statistics.
When you press on the small
i for information to the right of each (sub-)corpus name, you find more information about the corpus. More specifically:
On the right-hand side of the information window, you see which annotations are available to be queried for the selected sub-corpus.
node & meta::mt_fr="false"is entered into the query field. More precisely:
nodewill fetch also all tokens that are in such SMS; if you want to distinguish between messages and tokens, you should explicitly query for one or the other:
tok & …or
msg & ….
By clicking on the little piece of paper next to the information
i in the list of sub-corpora, you get a list of all SMS in the respective sub-corpus.
From here, you can click on
full text to view the whole SMS (without any annotations).