Powered by <TEI:TOK>
TEITOK Help Pages
TEITOK has a modular set-up in which you can develop new functionalities for a project by simply creating a PHP file in the Sources folder of the project, that puts all its output into a variable called $maintext - and existing function can be overwritten in the same fashion. As such, there is no definite list of all functions of TEITOK. However, below is the current list of functions including in the common folder of the TEITOK distribution. More functions are constantly added.
View XML files
For all types of information that should either not be distributed with the XML files themselves, or annotations
that can cross with the base TEI tags, TEITOK provides a module to manage stand-off annotation files.
The annotation section of the admin settings should define which
stand-off annotations there are, and each annotation type xxx should have a file xxx_def.xml in the Annotations
folder that defines the fields (and possible values) of the annotation. The annotation module than allow users to
view the annotations, and editors to create new annotations, or correct existing ones.
For longer XML files, TEITOK should be set to display the XML file one page at a time (since otherwise the
browser is likely to crash or become very slow). The page index module show a list of all the pages in an XML file,
allowing you to quickly jump to a specific page. It can also display other indexes alongside the page index when
marked out in the XML file as milestone elements - such as for instance books, chapter, etc.
TEITOK comes with a built-in XML based POS tagger called Neotag, initally built for the CorpusWiki project. The tagger itself is a unix program that can be called from within the interface. Neotag is easy to train, and training can be done on the corpus itself, either for languages for which no taggers exist, or for historic or dialectical variants of a language. The Neotag modules allows you to explore the settings, search the parameter files for the tagger, check whether the tagger matches the tagset, etc.
TEITOK can display spoken files that are time-aligned to an audio file, and allow you to listen to specific utterances in the file. In case there is an audio file, but no alignment, the alignment function allows you to produce a coarse audio alignment, which for the purposes of listening to bits of the text is often good enough. The alignment function also allows you to cut off irrelevant parts of the audio at the beginning or the end of the audio file, while keeping the audio alignment in tact.
Interlinear Glossed Text
Tokens in TEITOK XML files can contain morph nodes, specifying the morphological decomposition of the word. Morphemes
are shown in the normal file view, but can also be shown as Interlinear Glossed Text.
Manuscript line view
TEITOK transcription can be linked to their facsimile images. This can be done on any level, including paragraphs and word,
but the basic level is the manuscript line; when lb elements in the XML files contain a @bbox element describing which
part of the facsimile image defined in the pb above the lb that lb corresponds to (the bounding box), the line view mode allows you to
display the page line by line, with the transcription of each line directly below the part of the image it belongs to. This
is not only visually attractive, but also helps in detecting potential transcription errors to quickly improve the text.
When the transcription is created by a OCR (or HTR) tool, TEITOK can convert the hOCR format to TEI/XML maintaining
the @bbox elements (which is also where their name comes from)
When @bbox elements are also given for tokens, the token edit window will display the word in the right upper corner,
making TEITOK a de-facto OCR post edit tool.
Create new XML file
This module provides various ways to create a new XML file for your corpus; it allows you to define the metadata from the start, and either allows you to paste an existing XML file, or use a WYSIWYG editor to create a TEI document. Given the nature of this editor, this can only be used for new XML files, since there is too much information in existing XML files that can be removed or modified by this editor. However, it does even allow you to paste RTF or Word documents, keeping paragraphs and basic typesettings like italics and bold.
Edit XML files
TEITOK allows you to define a verticalized view for a specic text, allowing you to quickly correct all or a selection
of the tokens in the file.
PDF to TEI
This module allows you to take a PDF file, and convert it to a TEITOK/XML document. This process will create an image for each page, and either generate a temporary file for page-by-page transcription, or run it through the tesseract OCR program (when installed) and convert the resulting hOCR file to TEI/XML keeping the bounding boxes.
Page-by-Page Facsimile Transcription
TEITOK has a graphical interface to allow transcribing facsimile images page-by-page. Since pages are not elements in TEI (there are only page beginning tags), this trancription uses a temporary format to facilitate the transcription. Once the transcription is completed, the temporary file is converted to a TEI/XML file. (more)
Whenever changes are made to XML files within the TEITOK interface, a backup is created before saving (one backup for
each day). When things go wrong, these backups can then be retrieved to recuperate a previous verison of the file.
Corpus Search and Edit
Once of the core functions of TEITOK is to make your corpus searchable by creating an indexed corpus from the XML files using the Corpus Workbench/Corpus Query Processor (CQP). This CQP corpus can then be searched in the TEITOK interface. TEITOK uses a custom-built program (tt-cwb-encode) to create the CQP corpus files from the XML files, which can also incorpora stand-off annotation files. In the creation of the CQP corpus, this program keep track of the byte offset of each token in its corresponding XML file, and rather than displaying the CQP results directly, the result list shows a list of XML fragments.
TEITOK makes it easy to edit XML files, by allowing admin users to just click on a word and correct any wrong annotation it contains. However, for larger editing sessions going through each word one by one is not very efficient. Therefore, TEITOK allows to edit using CQP: CQP can be used to search for a specific set of data, say all occurrences of five, and then the multi-token edit allows you to change all results in one go (restricted to batches of 500 items at a time due to the limitations of HTML), say by changing the POS tag for all of them to number (MD in EAGLES) in case the tagger the tagger often got it wrong. It is also possible to produce the results as a list to quickly edit them by hand, where a regular expression can be used to pre-modify all results.
For corpora that have dependency relations, and have them correclty exported to the CQP corpus, this module allows you to search the corpus using a combination of the CQP language and Tiger search, allowing you to look for, say, any noun dominating a preposition.
Highlight CQP results
This module allows you to use CQP searches and highlight the matching results directly in an XML file. This is not meant
as a serious search method, but more for didactic purposes hightlighting say all the nouns in a text.
Raw Text Search
Searching in TEITOK in principle uses CQP. However, since not everything is exported to the CQP corpus, this makes it difficult to search for specific types of information. The raw search allows you to search in a simple string-based manner, and show which of the XMl files contain the search string.
General project setup
The settings of TEITOK are stored in an XML file settings.xml. However, given how much can be defined in TEITOK, there is a settings function that gives you an overview of the current definitions in your project, as well as a description of all the other things that can be defined in the settings.
HTML Page Editor
In order to adorn your project with static web pages, TEITOK comes with a WYSIWYG editor to add and edit HTML pages describing your project, the team, the history, etc.
TEITOK allows the user to choose which language(s) should be used for the interface; for the bulk of the interface elements, the translations should be given in a common PHP module. However, each project can introduce interface elements, say in the metadata header, or even display data from the corpus itself that are in need of localization. The translation for those elements can be given in a tab-separated text file in the Resources folder. Since it is hard to track down all the elements that need to be translated, the interface keeps track of all the elements it tried to translate, but for which it could not find the translation (at least when you are logged in). The internationalization module then helps you to quickly provide a translation for all the missing words or phrases. The admin interface elements are not translated (in principle) but always show in English.
TEI files should ideally be self-contained, that is to say, all the information relevant to the file should be encoded in the teiHeader, including many data that will be the same for each file in the project, such as the publisher of the files, the channel of the files (written/spoken), the language of the files, etc. Since these data are typically the same for all files, it is not useful to edit this info for each file by hand. The Template editor helps to set-up and empty XML template that contains all those metadata, which can then be used when creating a new XML file. Projects that have various different types of XML files can have different templates for each of those types.
This module allows you to upload files to your project. Which files can be uploaded, and which folder they will be uploaded into is defined in the settings.
For projects that contain a position-based POS tag annotation, TEITOK can help to make the tags more readable; it does this by using a definition file tagset.xml that explains all the different positions in the tagset. The tagset module then displays this tagset as a table to the user. The tagset module can also be used to describe a specific tag, and to check whether all the tags used in the CQP corpus are valid according to the tagset. The tagset.xml file is furthermore used in various places to display an explanation of the tag rather that the tag itself - for instance when hovering over a word, TEITOK will not show, say, NCMS, but rather Common Noun, masculine, singular.
Metadata Header Editor
TEITOK does not display the raw teiHeader metadata, but rather displays a selection of those data in HTML (table) format. And it uses the same sort of technique to make a selection of the teiHeader data editable in a simple HTML form. The definitions of which data to display/edit, and how to display those data are given in a set of template files called teiHeader.tpl. The header editor helps to build and editor those template files.
CSV Metadata Edit
In principle, each XML files in TEITOK is edited independently. In order to nevertheless be able to edit a specific part of the metadata for all the files in the corpus, this module allows you to define an XPath definition of the part of the metadata you are interested in, and create a tab-separated text file with one line containing that field for each of the files. This text can then be edited in the interface, and the modified data can be loaded back into their respective XML files. This module can also be used to simply download the metadata CSV file to have a traditional style spreadsheet contain all the metadata.
Some corpora, especially paleographic corpora, will contain characters that are not easy to find on the keyboard. This module allows you to define simple sequences that will be automatically converted into those hard-to-type characters. It is also possible to define characters that will show in the raw manuscript view, but are automatically simplified in all the subsequent levels.
Admin users can use the user adminitration module to create new users for the project, and assign permissions to them.
When XML documents are provided with geocoordination data, typically giving the place the text was produced, TEITOK can map all documents onto Google Maps. For speed, this is not done directly from the XML files, but rather uses CQP to locate the geocoordinates. For this to work, an sattribute @geo has be be exported to CQP, and the geomap section of the admin settings should define how to display the mapping data.
PSD-X Syntatic Annotation
TEITOK includes a module to show, search, and even edit syntactic annotation. The format used for this is
PSDX, an XML version of the Penn Treebank file format. Trees can be displayed in a number of different formats
(SVG trees, graphs, tables, plain text), and annotation files can be queried using XPath queries.
TEITOK comes with a module that allows viewing and editing dependency trees. The dependency trees can be shown either as
graphs or as trees. And the edit module makes it easy to reattach nodes, and edit attachment lables.
The dependency relations are moduled as features over the tokens directly in the XML files. The relations are defined
in a definition file called deptree.xml in the Resources folder.
XDXF Dictionary Reader
One of the target groups of TEITOK is the construction of corpora for less-resourced languages. Such corpora are often accompanied by simple dictionaries for the target language. For this purpose, TEITOK contains a small dictionary module that can be used to browse, search, and display dictionary entries from a dictionary file. By default, TEITOK expects the dictionary to be in the standard format called XDXF. The XDXF section of the settings file defines the structure of the entries. The XDXF reader can also be used to edit entries and create new ones.
In many projects, there is a need to provide spreadsheet type data on the project website. In order to
make it easier to provide such data in a searchable way, TEITOK has a simple XML editor that can deal with
spreadsheet-like XML files. The XML reader section of the settings file
should define the columns of the XML file and how to display them. The reader also allows admin users to
add and edit records in the XML file.
Back to index