User Tools

Site Tools


05_facts_and_figures:01_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
05_facts_and_figures:01_corpus [2022/01/04 15:33] – [Tokens per language] Simone Ueberwasser05_facts_and_figures:01_corpus [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== Facts and Figures: the Corpus ======+====== The Corpus ======
 The corpus consists of roughly 500'000 tokens. However, counting tokens in a corpus with that many emoticons and other special characters as well as with a spelling that deviates greatly from the norm is nearly impossible. There is e.g. one participants who does not use any spaces in his SMS. His SMS consequently get counted as one single token. Thus, the figure has to be seen as an approximation. The corpus consists of roughly 500'000 tokens. However, counting tokens in a corpus with that many emoticons and other special characters as well as with a spelling that deviates greatly from the norm is nearly impossible. There is e.g. one participants who does not use any spaces in his SMS. His SMS consequently get counted as one single token. Thus, the figure has to be seen as an approximation.
 ===== Number of characters ===== ===== Number of characters =====
05_facts_and_figures/01_corpus.1641306824.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki