Legal information

A (very) short introduction to RegEx

“In computing, regular expressions, also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.” (Wikipedia).

As Wikipedia tells us, regex takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. In the easiest case, this pattern is a combination of letters like man. In this case, regex looks for the letter <m> followed by an <a> and then an <n> in the database, regardless of what the pattern is preceded or followed by. As a result, you might get man or manual, but you will not get Manchester, because the regex search is case sensitive, see below.

However, regex also allows you to search for such things as alternatives (man or men), for word boundaries or for the beginning or end of a line (here: beginning or end of an SMS). In fact, regex offers more than you can dream of. It is a syntax widely spread in programming languages, but of course we cannot cover all the details here. Instead, we try to offer an easy overview over the functions you might use most often in this corpus. For more information, we refer you to regular-expressions.info or to the introduction to regex at the computer linguistics department of the UZH.

Case sensitivity

If you use the simple query or word query, your search is not case sensitive, i.e. the system does not differentiate between upper and lower case. So a search for MAN and one for man or mAn will come up with the same results.

In the regex search, this is different. Here, the system does not differentiate between upper and lower case as default (you can set the value on the startpage, however this is not something we recommend, instead you should get used to thinking of each character as an individual without an equivalent in the other case and formulate your query accordingly). Thus, you will not get man as a result when searching for MAN etc.

Characters and letters

In regex, a character is not the same as a letter. Basically, everything you enter with one key on your keyboard is one character (Sometimes you use two keys to type in one character, e.g. in the case of a circumflex such as <ê>. This is still considered to be one character). So a digit, a full stop (.), a TAB or a carriage return (ENTER, RETURN etc.) is also a character you can search for.

Letters

Simple

You can search for every letter or combination thereof in the corpus by just typing it into the search field.
Example:
<man> will search for a lowercase <m> followed by a lowercase <a> and a lowercase <n>.

Alternatives

If you want to have alternative letters in one specific spot you can put the alternatives into square brackets.
Example:
m[aei]n 
will look for occurrences of either 
man
men
min

Variable letters

If you are looking for any letter, you can use <\w> (Remember as: word character.), i.e. a backslash followed by a <w>.

Example:
m\wn
will look for (among others):
mAn
mBn
mCn
…
man
mbn
mcn
…

Something similar can be achieved with [a-z] and [A-Z] respectively. Here you look for occurrences of any letter as well, but this time case sensitive. E.g <m[A-Z]n>. This search string can also be reduced to e.g. <[m-q]>, however useful this may be.

N.B.: <\w> covers all letters from A to z, i.e. uppercase and lowercase. In our corpus, it also includes the special letters äöüàéèß. However, it does not include special characters such as punctuation, spaces, <&> etc.

Any character

If you want to search for any character, use a fullstop instead.

Example:
m.n
will look for (among others):
mAn
mBn
man
mbn
m&n
m n
m_n
m?n
…

Diacritica

This corpus is set up so as to recognize umlauts and letters with accents as individuals (Keep in mind that this is not the case in many other uses of regex. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). Searching for mange will therefore not find any occurrences of mangé.

Digits

Just like <\w> above, you can use <\d> (Remember as: digit) to stand in for any digit.
Example:
n\d
will look for (among others):
n0
n1
…
n9


Separators

Individual separating characters

Many different characters can occur in between your letters and digits: comas, full stops, spaces etc. Most of these can just be used as they are, i.e:
  • space
  • coma
  • dash (-)
  • semicolon (;)
  • curly brackets ({})
  • colon (:)
  • ampersand (&)
  • percent (%)
  • exclamation mark (!)

NB: most of these characters do have a special function as well if they appear in a specific position. As you will see below, { } is one of the possible way to search for repeating characters. Thus, the character { can be recognized as a character in its own right or as a syntactic function depending on its position. The same goes for most of these characters.

Other separators are reserved by the regex syntax. To use them by their ordinary value, you have to escape them, i.e. you have to place a backslash in front of them. Thus, you type in <m\*n> to look for m*n. These characters are:

  • asterisk (*)
  • full stop (.)
  • all other brackets ([()])
  • slash, pipe and backslash (/|\)
  • question mark (?)
  • plus (+)
  • dollar ($)
  • caret (^)
Did we forget anything? Well possible. Just type in the character you are wondering about on its own. If you get an error, you have to escape it.

Word boundaries

If you are looking for the word man but don't want any occurrences of manual, superman or manchester, you have to tell regex that man has to be surrounded by word boundaries. These can be simple spaces, but also commas, full stops etc. A word boundary is expressed by <\b> (Remember as: boundary.), i.e. a backslash followed by a <b>.

Example:
\bman\b
will look for (among others):
man
!man,
?man)
…

Beginning and end of SMS

You can search for a string that stands at the beginning of the SMS by typing in a <^> in front of the string. The end of the SMS is marked by a <$> sign.
Example:
^man
will look for SMS beginning with (among others):
manual work …
manuscripts …

man$
will look for SMS ending on (among others):
… I saw the man

^man$
will look for SMS with only the word man


Quantifiers

Sometimes you might be looking for an expression which can be written with or without repeating letters. E.g. you might want to look for hallo, haaallo, halooooo etc. in the corpus. Since you do not know how often the individual vowels were repeated, you have to use quantifiers. Your options are as follows:

  • ?? ➝ 1 or 0 times
  • *? ➝ 0 or more times
  • +? ➝ 1 or more times
  • {n,m} ➝ at least n but not more than m times

Example:
h+?a+?l+?o+?
will find all variants of hallo

Using quantifiers is much more capable and demanding than this. The examples given here are called non greedy, there are also greedy quantifiers and possessive ones, which basically look for the last instead of the first occurrence of a token. Please refer to a more explicit manual for these functions.

Hint: it you find these options too complicated, consider using <{n,m}> only. With this function, you can fulfill nearly all requirements and it is the easiest function to remember.


Alternatives

Above, you have seen that you can search for different letters in one spot, e.g. you can search for man and men with the expression <m[ae]n>. But what if you want to look for either n8 or night or nacht or nuit? Here you have to set a <8> equal to the letters <eight>, <acht> and <uit>. To achieve this, you set the whole expression in parentheses and separate the individual variants by a pipe (|).

Example:
n(8|acht|ight|uit) 
will look for:
n8
nacht
night
nuit


A final word

What you have read here, is only a selection of the possibilities regex offers. To keep things more or less simple for you, we tried to document all the features you are likely to use while omitting everything you probably will not care about. Also, there are different implementations of regex in different programs and they support different features or not. Thus, if you want to use regex elsewhere, please read the according manual. If you need more functions, check regular-expressions.info .

On this page:
You might also be interested in:
Please don't forget to quote the corpus in your work.
Topic revision: r1 - 28 Apr 2015, SimoneUeberwasser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.


The corpora and documentation are licensed under the <strong>Creative Commons license: Attribution + NoncommercialThe corpora and documentation are licensed under the *Creative Commons license: Attribution + Noncommercial:
- Licensees may copy, distribute, display, and perform the work and make derivative works based on it only for noncommercial purposes.
- Licensees may copy, distribute, display and publish the work and make derivative works based on it only if they give the author or licensor the credits as follows:
Stark, Elisabeth; Ueberwasser, Simone; Ruef, Beni (2009-2014). Swiss SMS Corpus. University of Zurich. https://sms.linguistik.uzh.ch

Ideas, requests, problems regarding sms4science? Send feedback