05_facts_and_figures:01_corpus
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
05_facts_and_figures:01_corpus [2022/01/04 15:32] – created Simone Ueberwasser | 05_facts_and_figures:01_corpus [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ====== |
The corpus consists of roughly 500' | The corpus consists of roughly 500' | ||
===== Number of characters ===== | ===== Number of characters ===== | ||
Line 10: | Line 10: | ||
The problem becomes even bigger when trying to count the total number of tokens per language/ | The problem becomes even bigger when trying to count the total number of tokens per language/ | ||
- | Olla fratello!!! Come stai? Wie geht's dir so? Immer noch so lange am arbeiten wie früher? Ich hab endlich mein eigenes Restaurant und mucho travajo...; | + | |
| | ||
This SMS is counted as Standard German, because this language contributes the most words. However, if we quote the number of German Standard words in the corpus, we include all the Italian, Spanish, Swiss German and English words in this SMS into the total of German words in the corpus, too. | This SMS is counted as Standard German, because this language contributes the most words. However, if we quote the number of German Standard words in the corpus, we include all the Italian, Spanish, Swiss German and English words in this SMS into the total of German words in the corpus, too. |
05_facts_and_figures/01_corpus.1641306769.txt.gz · Last modified: 2022/06/27 09:21 (external edit)