03_processing:04_tokenizing
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
03_processing:04_tokenizing [2022/01/04 12:44] – Simone Ueberwasser | 03_processing:04_tokenizing [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 2: | Line 2: | ||
In very general term, a token can be seen as a word. Every sentence consists of different words, from a technical point of view we call them tokens. However, that is not all there is to a sentence, there is punctuation, | In very general term, a token can be seen as a word. Every sentence consists of different words, from a technical point of view we call them tokens. However, that is not all there is to a sentence, there is punctuation, | ||
- | A out-of-the-box tokenizer, as they exist in computational linguistics, | + | A out-of-the-box tokenizer, as they exist in computational linguistics, |
When tokenizing the SMS corpus, an ordinary tokenizer, as it is used in computational linguistics, | When tokenizing the SMS corpus, an ordinary tokenizer, as it is used in computational linguistics, |
03_processing/04_tokenizing.1641296669.txt.gz · Last modified: 2022/06/27 09:21 (external edit)