The corpus all-tagged contains all SMS in all languages. Data for all languages except Romansh are tagged with TreeTagger.
Next to that, the following sub-corpora per language are available:
The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics.
Please keep in mind that you also see corpora with upper-case letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our WhatsApp project.
Next to the name of each sub-corpus, you see the number of SMS (marked as "Texts") and tokens. You can use these figures for statistics.
When you press on the small i
for information to the right of each (sub-)corpus name, you find more information about the corpus. More specifically:
On the right-hand side of the information window, you see which annotations are available to be queried for the selected sub-corpus.
node & meta::mt_fr="false"
is entered into the query field. More precisely: node
will fetch also all tokens that are in such SMS; if you want to distinguish between messages and tokens, you should explicitly query for one or the other: tok & …
or msg & …
.
By clicking on the little piece of paper next to the information i
in the list of sub-corpora, you get a list of all SMS in the respective sub-corpus.
From here, you can click on full text
to view the whole SMS (without any annotations).