TEITOK - The Tokenized TEI Environment

TEITOK is a web-based system for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation. For visitors, the system provides a graphical user interface in which te annotated document can be visualized in a number of different ways, depending on what the user is interested in. And for administrators of the corpus, TEITOK uses the same interface to allow easy editing of the underlying XML document, meaning administrators can correct their corpus while they are consulting it.

TEITOK has a modular design, in which many different modules can be used on the same XML files. These modules provide useful interfaces for a wide range of different types of corpora, including:

Historical Corpora: For historical corpora, TEITOK provides the option to have an alignment between the transcription and the facsimile image, it provides the option to work with multiple orthographic realizations to combine several editions of a text into a single XML file, and it provides the option to create a searchable document map to see where in the world several phenomena are more frequent.

CQL Corpus Search

The rich XML format used in TEITOK is hard to search through. For easier access, all corpora are therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various orthographic forms, providing many ways to search through the data.

The type of corpora that TEITOK is meant for are very labour-intensive: for ancient texts, hardly any of the data will be available in digital format, and have to be scanned. In many cases, OCR will not work and even for human readers the texts are often very hard to read. And the data will display a lot of orthographic variation in which a lot of the linguistic annotation, including normalization, will have to be done by hand. As a result, most corpora created with TEITOK will have a limited size, and searching for linguistic properties in them will not yield a lot of results. Therefore, TEITOK offers the option to index the corpus in a central database, which can be searched via this site. Each search result will only display the direct context of the word, and will link directly to the word in the original text on the site of the project it originated from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the original project.

TEITOK is freely available for anybody who wishes to create richly annotated textual corpora, and runs on any LINUX based web server. See obtaining TEITOK.