Legal information

SMS Navigator: Start Page

The startpage is the entry point for every query. It consists of four parts:


Types of queries

You can use the following methods to search for patterns in the corpus:
  • Simple Query: With simple query selected, any combination of letters you type in the entry field gets searched, regardless of the letters and signs surrounding it. A search for man will find any occurrence of the word man but also all occurences of Manchester, manditory etc.
  • Word Query: Other than the simple query, the word query looks for the combination of letters you typed in surrounded by anything else than letters. A search for man will not come up with Manchester or manditory, but with man or man. or any other combination of man with punctuation or space.
  • Regex Query: This should really be the preferred choice for experienced researchers. “In computing, regular expressions, also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.” (Wikipedia).

Other options:
  • Authors: Links back to the legal information in this documentation
  • Documentation: Links back to this documentation

Entry fields

Type the token you are looking for into the entry. It will be treated differently depending on the query type selected.

The search is not treated in the way Google treats a search. When searching for you too in Google will find you too but also you said that too. The SMS Navigator will only find occurrences of you too and ignore you said that too, because the Navigator does not recognize the two words you and too as individual expressions, instead the letter combination y-o-u-SPACE-t-o-o is considered as an entity and treated accordingly.

If you want to see all SMS in the corpus, just leave the field blank and press _Start Query…_.

If you need more sophisticated search options like you in the same sentence as too, please consider a regex search.


Select Corpus & Annotations

The first two selections, Select Corpus and Annotations are only available for reasons of compatibility. No data can be selected here.


Two options are possible here. You can either select all SMS or only those for which demographic data is available.

Case Sensitive

If this option is set to no, as is the default for simple query and word query, the search is performed regardless of the case of the individual letters. Thus, you, You and YOU are all considered the same and will bring the same results

The default for regex query is for case sensitivity to be set to yes, meaning you, You and YOU are three distinct expressions. A search for you will thus not find any occurrences of YOU etc.

Of course case sensitivity can be set manually for each individual search in each query type.

Page Size

The page size parameter defines, how many SMS are to be displayed on each screen. Keeping this value small will improve the overall perspective, while a large value will allow for better search functionality within the result view.


Ever SMS has been treated with three different language taggings. All selections offered here react as and searches, i.e. if you select Swiss German as a main language and English as a borrowing language and Romansch as nonce borrowing language, you will only find SMS that actually fulfill all three conditions. In the example given here, you will most likely get no results. These selections, of course, are an addition to what you entered in the entry field, so the system will look for SMS that fulfill your language selection but also contain the search string you entered in the entry field.

For the main language, you can also select multilingual SMS, i.e. SMS with more than one main language. Very often, these are rather short SMS, like "yes, gut.".

The two other language tagging, i.e. borrowings and nonce borrowings offer the additional options to chose SMS with any language annotation or with none of them.

Version History

This corpus is continuously being developed further. Thus, the data within the corpus can change from time to time, especially the tagging of the individual languages. When writing papers about the corpus, please always quote the corpus version date on the startpage to make clear which version of the corpus you base your study on.

On this page:
You might also be interested in:
Please don't forget to quote the corpus in your work.
Topic revision: r1 - 29 Apr 2015, SimoneUeberwasser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

The corpora and documentation are licensed under the <strong>Creative Commons license: Attribution + NoncommercialThe corpora and documentation are licensed under the *Creative Commons license: Attribution + Noncommercial:
- Licensees may copy, distribute, display, and perform the work and make derivative works based on it only for noncommercial purposes.
- Licensees may copy, distribute, display and publish the work and make derivative works based on it only if they give the author or licensor the credits as follows:
Stark, Elisabeth; Ueberwasser, Simone; Ruef, Beni (2009-2014). Swiss SMS Corpus. University of Zurich.

Ideas, requests, problems regarding sms4science? Send feedback