Table of Contents

Sub-corpora

The corpus all-tagged contains all SMS in all languages. Data for all languages except Romansh are tagged with TreeTagger.

Next to that, the following sub-corpora per language are available:

The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics.

Please keep in mind that you also see corpora with upper-case letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our WhatsApp project.

Tokens and messages per sub-corpus

Next to the name of each sub-corpus, you see the number of SMS (marked as "Texts") and tokens. You can use these figures for statistics.

Information about the (sub-)corpora

When you press on the small i for information to the right of each (sub-)corpus name, you find more information about the corpus. More specifically:

On the right-hand side of the information window, you see which annotations are available to be queried for the selected sub-corpus.

List of chats in the sub-corpus

By clicking on the little piece of paper next to the information i in the list of sub-corpora, you get a list of all SMS in the respective sub-corpus.

From here, you can click on full text to view the whole SMS (without any annotations).