<TEI:TOK>

Main Menu


Powered by <TEI:TOK>
Maarten Janssen, 2014-

TEITOK Help Pages

TEITOK vs TEI/XML

TEITOK is intended as a corpus environment based on TEI/XML files. Nevertheless, there are several ways in which TEITOK files deviate from standard TEI/XML P5 files. These can have a variety of different sources, and this page attempts to list the main differences. The differences are small and localized however, and there is a script in the export module to convert the TEITOK version into proper TEI P5.

An important thing is that since TEITOK can deviate from TEI/XML, it does not use the DTD of TEI/XML. In fact, since there is no need to do so, it does not do any DTD checking and all, leaving it up to the user to make sure the files are "proper" TEI. The only real requirement is that the files are in valid XML, with the strong recommendation to stay as close to the TEI/XML standard as possible. But TEITOK does not strictly speaking enforce any standard.

Since TEITOK is a corpus environment, it focusses only on two parts of the TEI files: the <teiHeader> for the metadata, and the <text> for the content. Other elements, ie. the <sourceDoc> and the <facsimile> are ignored in most of the system.

Tokenization and Corpus

The initial idea behind TEITOK is that it turns TEI/XML documents into a searchable corpus. For this, it adds inline tokenisation to the TEI/XML document. There is a tokenisation element in TEI: <w>, but TEITOK uses the element <tok> instead. There are two main motivations for this. The first reason is that tokens are not words, since punctuation marks are tokens as well, whereas they are <c> in TEI. Yet spaces are also <c> but not tokens. So rather than abusing <w>, it uses a different label. And the second reason is that <tok> can be used for any kind of linguistic annotation, so the attributes on <tok> are where most of the deviation from TEI takes place.

TEI/XML documents are seen as documents in their own right, that can combine various versions of a text in a single document. Since TEITOK documents are used for a corpus, this is not case: there should be a single version of the text that is the raw document, and everything else is annotation on top of that raw document. And the raw document is not (typically) a TEI document in its origin, but either a text on paper transcribed in TEI, or a document in any format converted to TEI. So elements like <choice> or <app> should not be used in TEITOK documents, and would lead to inconsistent corpora. The raw document should be the internal text of the <text> elements, and no annotation should be a text node. So a normalisation in standard TEI/XML would look as follows:

<choice>
	<org><w>errror</w></org>
	<reg><w>error</w></reg>
</choice>

This goes against the idea behind TEITOK in several ways. Firstly, TEITOK files should not use <w>. Secondly, the two versions in this choice element do not have the same status: the first is part of the original document, the second is annotation, and should hence not be a text element. And thirdly, by having two versions, the corpus would inadvertently end up with the sequence errror error, since there are two different tokens in the choice element. The treatment in TEITOK is rather as follows, with a single token that has an attribute containing the normalised form as an annotation element:

<tok reg="error">errror</tok>

The fact that documents are tokenised has various secondary effects, such as the fact that <lb n="false"/> is not needed in TEITOK: whether a line-break breaks a word or not is determined by the question whether the <lb/> is inside a <tok> or not, and does not need to be established explicitly.

Additional Attributes

TEITOK uses some proprietary elements and attributes. Most of the uncommon attributes will be on the <tok> elements, but there are some elements that can be used on any node in the text. Since TEITOK does not use namespaces, these attributes are used as-is, but do all belong to the tt namespace.

  • For audio transcriptions, @start and @end are used to mark the start and end time of a textual element in seconds in the sound file associated with the text as indicated in the <media> element in the <recording>
  • For facsimile transcriptions, @bbox is used to mark the region of the facsimile image it is associated with, in the format "x1 y1 x2 y2" from the hOCR format. The facsimile image itself is supposed to be indicated in the @facs element of the preceding <pb/>. Any element with @bbox coordinates can easily be converted to its TEI correspondence, which would be a <surface> node
  • For files that are part of a TEITOK apparatus system, any node can have a @appid to link it to the same element in different witnesses.

Unsupported Features

The way XML files are treated in TEITOK parses the files individually and does not load anything apart from the XML file itself. This means that XML techniques to use external files, such as XInclude or XLink, do typically have no effect. The same holds for any ID-based references to external files such as <placeName ref="places.xml#LOND1">London</placeName>, although there is a partial support for that format that allows you to follow such links in the CQP corpus; so in the CQP export, you can use the link above to get nodes under the node with id LOND1 in the file places.xml - where places.xml is required to be located in the Resources folder.

Metadata

In the corpus, the TEI header is used for the metadata. TEITOK uses XPath to define which parts of the header to use, and exports the result of the XPath query to the searchable corpus. And XPath is not only used to export data to the corpus, but also to make header content easily editable. So all relevant data use be uniquely addressable by an XPath query.

Firstly, all content that is meant to become searchable should be available as a single string. This means that on the one hand, string values cannot be d distributed over XML nodes - if we want the full name of a person to be searchable, it cannot be encoded as <personName><firstName>John</firstName><lastName>Smith</lastName></personName>. It should either be simplified to <personName>John Smith</personName>, or the full name should be added as an attribute: <personName full="John Smith"><firstName>John</firstName><lastName>Smith</lastName></personName>

And on the other hand, it means that the searchable should not be part of a larger text, for instance in a <p> - it is recommended to avoid <p> altogether in the teiHeader, even there where TEI prescribes it.

Secondly, lists should be avoided for searchable content. If there are various people (say speakers) involved in a document, they can be in various <person> fields in the same <listPerson>, but each <person> should have a uniquely identifying attribute, which should be consistent throughout the corpus.

And thirdly, the role of a searchable node should always be specified by its ancestors; this mostly affects the <respStmt>, where the name of the person and what that person was responsible for as sister nodes. In TEITOK, this is not recommendable. So rather than the standard

<respStmt>
	<resp>Description</resp>
	<persName>Maarten Janssen</persName>
</respStmt>

In TEITOK it is recommendable to use

<respStmt>
	<resp n="Description">Maarten Janssen</resp>
</respStmt>

On top of the real differences, TEITOK is often used for corpora that have needs not (yet) foreseen in TEI - for instance in the case of learner corpora. For those needs, TEITOK provides some non-standard solutions that are used in TEITOK corpora in use - such as a <taskDesc> for the description of a task the author was given for the text.

The recommended metadata structure resulting from this can be found in the metadata description


Back to index