- DATA COLLECTION
- BROWSING THE CORPUS
- THE QUESTIONNAIRE
- DATA PROCESSING
- FACTS AND FIGURES
The corpus all-tagged contains all SMS in all languages. Data for all languages except Romansh are tagged with TreeTagger.
Next to that, the following sub-corpora per language are available:
The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics.
Please keep in mind that you also see corpora with upper-case letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our WhatsApp project.
Next to the name of each sub-corpus, you see the number of SMS (marked as "Texts") and tokens. You can use these figures for statistics.
When you press on the small for information to the right of each (sub-)corpus name, you find more information about the corpus. More specifically:
On the right-hand side of the information window, you see which annotations are available to be queried for the selected sub-corpus.
By clicking on the little piece of paper next to the information in the list of sub-corpora, you get a list of all SMS in the respective sub-corpus.
From here, you can click on to view the whole SMS (without any annotations).