The corpus all-tagged contains all SMS in all languages. Data for all languages except Romansh are tagged with TreeTagger.
Next to that, the following sub-corpora per language are available:
The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics.
Please keep in mind that you also see corpora with upper-case letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our WhatsApp project.
Next to the name of each sub-corpus, you see the number of SMS (marked as "Texts") and tokens. You can use these figures for statistics.
When you press on the small i for information to the right of each (sub-)corpus name, you find more information about the corpus. More specifically:
On the right-hand side of the information window, you see which annotations are available to be queried for the selected sub-corpus.
By clicking on the little piece of paper next to the information i in the list of sub-corpora, you get a list of all SMS in the respective sub-corpus.
From here, you can click on full text to view the whole SMS (without any annotations).