Legal information

ANNIS

ANNIS, which stands for ANNotation of Information Structure, is an open source search and visualization tool that focuses on multilayer linguistic corpora. It is thus the ideal tool for queries in the part of our corpus that was normalized and PoS annotated. It allows to search the same corpus in multiple layers.

The query syntax used by ANNIS has a steep learning curve as is the case with most tools that offer an abundance of options. But ANNIS also offers an excellent tutorial that is built into the browsing tool. We therefore refrain from writing an additional manual here. Instead, we only show you how to get started and how to find the built in manual. If you do not want to read the manual but just get started, we recommend you to have a look at a few examples that we put together based on our own corpus. This might give you a good idea about how to create your own queries.


Getting started

When you start ANNIS, you get a screen similar to this one:

ANNIS_start.png

Because you are not logged in (1), you cannot select any corpora (2) but see the tutorial (3) instead. Actually, this might be a good moment to have a look at the tutorial. If not, make sure that you are a registred user and log in (1) with the credential that you find on the starting page. Now, your screen changes slightly:

ANNIS_screen.png

In the bottom left, you can now select the subcorpus, with which you want to work (1). In the top left (2), you can type in your query or create one with the tools provided. In the center, you see some examples, until you type in your own query in (2). You can come back to these examples as well as to the tutorial any time by pressing on Tutorial (4).


Corpora

Non-dialectal German

  • label: deu-tagged
  • Contains all the SMS with non-dialectal German (deu) as a main language
  • 7'228 SMS with a total of 175'500 tokens
  • The SMS were normalized to Standard German spelling and PoS tagged

French

  • label: fra-tagged
  • Contains all the SMS with French (fra) as a main language
  • 4'626 SMS with a total of 122'500 tokens
  • The SMS were normalized to Standard French

Swiss German dialect

  • label: gsw-tagged
  • Contains all the SMS with Swiss German (gsw) as a main language
  • 10'674 SMS with a total of 288'400 tokens
  • The SMS were normalized to Standard German and PoS tagged

Italian

  • label: ita-tagged
  • Contains all the SMS with Italian (ita) as a main language
  • 1'475 SMS with a total of 38'700 tokens
  • The SMS were normalized to Standard Italian

Romansh

  • label: roh
  • Contains all the SMS with a variety of Romansh as a main language
  • 1'118 SMS with a total of 29'500 tokens
  • The idiom used is specified with the lang_main attribute (see also example query below):
    • roh-sr: romontsch sursilvan
    • roh-st: rum√†ntsch sutsilvan
    • roh-sm: rumantsch surmiran
    • roh-pt: rumauntsch puter
    • roh-vl: rumantsch vallader
    • roh-gr: rumantsch grischun
  • Tokens were normalized to Rumantsch Grischun.


Layers, labels and values

Our corpus, as it is available in ANNIS, is build up of different layers, which can be queried and (except for meta) displayed. Here is an example:

Query: abbrev ="" & lang = "eng" & #1.#2

as you can see, the expressions searched for are highlighted.

Annis_example.png

You can see the following layers:
  • tok: Original Token
  • abbrev: Was this token marked as an abbreviation by our team? If yes, this layer is marked as true. This layer is only visible in the results if you actually queried for abbreviations.
  • gloss: Normalized layer
  • lang: Nonce borrowings as they were annotated by the team
  • lemma: Lemma as it was annotated by the TreeTagger
  • pos: Part of Speech as it was annotated by the TreeTagger

The additional layer meta cannot be displayed but queried.

When working with the pos layer, you can use the following tag sets:


Sample queries

Beni Ruef was as kind as to put together these queries from all our subcorpura. If you have a close look at the syntax, you will learn how to build your own queries by combining the individual elements.

Male teenagers writing about school:
meta::sex="M" & meta::age=/1[3-9]/ & lemma="Schule"

All French nonce borrowings (in German, Romansh or French subcorpus):
lang="fra"

The three tokens es hat noch in exactly this order:
/[Ee]s/ & "hat" & "noch" & #1.#2 & #2.#3

All spelling variants of gesagt:
gloss="gesagt"

Typical Swiss German possessive construction (e.g. em Hans sis Hus ≈ dem Hans sein Haus):
pos="ART" & pos=/(NE|NN)/ & pos="PPOSAT" & pos="NN" & #1.#2 & #2.#3 & #3.#4

Emphasis using the personal pronoun (1sg) before the verb in Italian:
lemma="io" & pos=/VER:.+/ & #1.#2

All Swiss German nonce borrowings from Puter speakers:
lang="gsw" & meta::lang_main="roh-pt"

On this page:
You might also be interested in:
Please don't forget to quote the corpus in your work.

Topic revision: r7 - 02 Aug 2016, SimoneUeberwasser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.


The corpora and documentation are licensed under the <strong>Creative Commons license: Attribution + NoncommercialThe corpora and documentation are licensed under the *Creative Commons license: Attribution + Noncommercial:
- Licensees may copy, distribute, display, and perform the work and make derivative works based on it only for noncommercial purposes.
- Licensees may copy, distribute, display and publish the work and make derivative works based on it only if they give the author or licensor the credits as follows:
Stark, Elisabeth; Ueberwasser, Simone; Ruef, Beni (2009-2014). Swiss SMS Corpus. University of Zurich. https://sms.linguistik.uzh.ch

Ideas, requests, problems regarding sms4science? Send feedback