Main Menu

Powered by <TEI:TOK>
Maarten Janssen, 2014-

TEITOK Help Pages

Standoff Annotations

The biggest drawback of using XML for linguistic annotation is that XML is obligatorily hierarchical: tags cannot overlap. That is why in TEI pages are not marked as block, but only the beginning of the page is marked by an empty tag, with an implicit ending. This because otherwise paragraphs that span across a page break would end up breaking the XML. And that is why in TEITOK, XML elements are split up when they cut across a token (or sentence) boundary. However, such solution can only solve so much. Whenever more information needs to be encoded in the corpus that spans multiple words, and which can with other types of information, there is not way to encode such inforarmation inside the XML file. The only way to encode such information is what is called stand-off: stored in a separate file which is linked to the TEI document.

Stand-Off in TEITOK

In TEITOK, standoff files are themselves also XML files, stored in the Annotations folder. Each type of stand-off annotion is kept in a separate file, so for error annotation there can be a file Annotations/error.xml. In principle, all error annotations for all XML files is kept in a single file (although they can be physically kept in separate files using XML inclusion). The XML file consists of three parts: the first part contains a description of the annotation type, the second part defines which tags are used in the annotation, and the third part contains the actual annotations as a list of annotations per XML file.

Say we have a file FILE001.xml in our xmlfiles, and we want to mark names over that (and other) files, where our names can overlap, so we cannot just add them to our XML - say in National Bank of Schotland we want to be able to mark both National Bank and Bank of Schotland as a name, which would lead to incorrect XML if we do it correctly. The way that is done is to keep a file called names_FILE001.xml, which will be located in a folder Annotatios. If National is the 3rd word in the text (w-3), then in that file, we can define that [w-3,w-4] is a name, and that [w-4,w-5,w-6] is a name; since both are stored as entities referring to the tokens, there is no problem in them overlapping. And since they are kept in separate files, our original file is not affected at all; and we can have multiple, independent stand-off annotations, which can even contradict each other.

But the fact that stand-off files are kept sepately means that the normal way of editing does not work: our FILE001.xml does not even know there is a names annotation for it. There is, instead, a dedicated module for stand-off annotation, which you call by referring both to the file you want to annotate, and the type of annotation you want to edit/view: index.php?action=annoation&cid=FILE001.xml&annotation=names, which will hence associate xmlfiles/FILE001.xml to Annotations/names_FILE001.xml, where our stand-off annotations for names are stored. In this view, we can see the annotatios, and edit them if we are logged in as an editing user.

In order to work with our stand-off annotation, we need to define what we want to annotate. We do this in a file Annotations/names_def.xml, which contains the definitions for our names annotation. In the example below, our annotation will be called "Named Entities", and allow us to assign two things to each name: a UID, and a type of name, where the latter lets us choose between either a person or a company.

    <interpGrp id="names" name="Named Entities">
	<interp key="uid" display="Identifier" long="Unique Identifier"/>
	<interp key="type" display="Type" long="Type of Name" colored="1" title="1">
	    <option value="NP" display="Person"/>
	    <option value="CP" display="Company"/>

With this, the interface will allow us to click on a selection of tokens, and a pop-up will appear that lets us type in a UID and select a Type. We furthermore specify that our types are the relevant subclassification, and the interface will display a button for each type, and clicking that will highlight all the person names, or all the company names in the text. If we mark out the National Bank, our names_FILE001.xml will look as follows:

    	<span corresp="#w-3 #w-4" uid="CN001" type="Company" id="an-1">National Bank</span>

In order to make it easier to work with annotations, we need to furthermore tell the system about it, which we do in the settings. So for our names, we can add the following section to our settings.xml:

        <item key="names" type="standoff" display="Named Entities" admin="1"/>

This will create a link at the bottom of each XML file to jump to our names annotation, either by creating a new file, or by reading the existing annotation.

Stand-Off in CQP

Stand-off annotations can be exported to CQP by defining which annotations to export in the cqp definitions. Due to the nature of CQP, not annotations can be fully exported: In order to export our entire annotation, we would add the settings below, which will export our names as name sattributes, with a drop-down for the type just as in the case of other sattributes in TEITOK/CQP:

        <item key="names" filename="names" display="Named Entities">
            <item key="uid" display="Unique Identifier"/>
            <item key="type" display="Type of Name" type="select" translate="1"/>

With that we can search for [word="Bank"] :: match.name_type="Company"; to look for all occurrences of Bank inside a company name. The resulting corpus for our example fragment would hence have a VRT representation that looks as follows:

    w-1	In
    w-2	the
    <name type="Company" uid="CN001">
    w-3	National
    w-4	Bank
    w-5	of
    w-6	Schotland

But stand-off annotations form one of the main reasons why TEITOK does not use a VRT format; there would be no way to export our overlapping names using VRT, since it would be impossible to say which name we are closing with a </name>. Instead, tt-cwb-encode writes CQP fiels directly, and generates overlapping sattributes if our annotations contain them; it is important to state that any sattribute overlapping with an existing on is completely ignored by CQP, and also tt-cqp, which does allow overlapping sattributes, still does not fully support them, since match name_uid does not refer to a unique value if we have overlapping sattributes, and CQL is not well equipped to handle that.

Another things is that in TEITOK, annotations can be discontinuous: corresp="#w34 #w-41" defines an annotation over two words with several intervening words, which is for instance useful for annotating split verbs in German. But sattributes in CQP cannot be discontinuous, they only have a beginning and an end; which is why the CQP corpus for our split annotation will fill up the entire segment 34-41.

Back to index