03_processing:04_tokenizing
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
03_processing:04_tokenizing [2022/01/04 12:43] – created Simone Ueberwasser | 03_processing:04_tokenizing [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 2: | Line 2: | ||
In very general term, a token can be seen as a word. Every sentence consists of different words, from a technical point of view we call them tokens. However, that is not all there is to a sentence, there is punctuation, | In very general term, a token can be seen as a word. Every sentence consists of different words, from a technical point of view we call them tokens. However, that is not all there is to a sentence, there is punctuation, | ||
- | A out-of-the-box tokenizer, as they exist in computational linguistics, | + | A out-of-the-box tokenizer, as they exist in computational linguistics, |
When tokenizing the SMS corpus, an ordinary tokenizer, as it is used in computational linguistics, | When tokenizing the SMS corpus, an ordinary tokenizer, as it is used in computational linguistics, | ||
- | ^Language^automatic tokenization^corrected to|Translation^ | + | ^Language^automatic tokenization^corrected to^Translation^ |
|French|[jsuis]|[je][suis]|I am| | |French|[jsuis]|[je][suis]|I am| | ||
|French|[ajourd' | |French|[ajourd' |
03_processing/04_tokenizing.1641296634.txt.gz · Last modified: 2022/06/27 09:21 (external edit)