03_processing:04_tokenizing
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| 03_processing:04_tokenizing [2022/01/04 11:43] – created simone.ueberwasser.ds.uzh.ch | 03_processing:04_tokenizing [2022/06/27 07:21] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| In very general term, a token can be seen as a word. Every sentence consists of different words, from a technical point of view we call them tokens. However, that is not all there is to a sentence, there is punctuation, | In very general term, a token can be seen as a word. Every sentence consists of different words, from a technical point of view we call them tokens. However, that is not all there is to a sentence, there is punctuation, | ||
| - | A out-of-the-box tokenizer, as they exist in computational linguistics, | + | A out-of-the-box tokenizer, as they exist in computational linguistics, |
| When tokenizing the SMS corpus, an ordinary tokenizer, as it is used in computational linguistics, | When tokenizing the SMS corpus, an ordinary tokenizer, as it is used in computational linguistics, | ||
| - | ^Language^automatic tokenization^corrected to|Translation^ | + | ^Language^automatic tokenization^corrected to^Translation^ |
| |French|[jsuis]|[je][suis]|I am| | |French|[jsuis]|[je][suis]|I am| | ||
| |French|[ajourd' | |French|[ajourd' | ||
03_processing/04_tokenizing.1641296634.txt.gz · Last modified: (external edit)
