Main Menu

Powered by <TEI:TOK>
Maarten Janssen, 2014-

TEITOK Help Pages

Seting up Corpus Search

To provide fast and powerful search options for the corpus, TEITOK creates an indexed corpus out of the XML files in TEI format. The corpus is (by default) created by a dedicated program called tt-cwb-encode that converts TEITOK style XML files into CWB style binary files. tt-cwb-encode is an application that has to be compiled on the local server, and can be found in the src section of the Git files. The executable is expected to be located in /usr/local/bin unless defined otherwise in the bin section of the settings.xml.


In the settings.xml file, you define which elements of your XML files should be used in the indexed corpus. This is done in the section <cqp> of the settings. These settings are used both in the interface to provide the user with information about how to query the corpus, and in the creation of the corpus itself. Below is an example of some typical CQP settings:

Global settings

The @corpus defines the name of the CQP corpus that is created. It is not visible to the user in any way, but each corpus needs a unique name in CQP.

The @searchfolder indicates which folder(s) the files are located that are to be used in the CQP corpus. Folders should be separated by spaces, and each folder will be read recursively. The default is xmlfiles which means that all XML files (located the xmlfiles folder) will be used for the CQP.


The section <pattributes> defines which attributes on the <tok> elements (the tokens) are to be exported to the CQP corpus. Each attribute is define by an <item> in the settings, where the @key is the name of the attribute on the <tok> Each item can have a @display which is the name of the attribute, and a @long which is an (optional) long name in case the @display is an abbreviated name. When no @display is given, the display value from the <xmlfile> section is used. There are two attributes that are always exported since they are required to make CQP work, and allow the results to be looked up in the XML files: @id which is the id of the token, and @word which is an obligatory field in CQP. Unless defined otherwise, @word exports the @form feature of the token.


The section <sattributes> defines which attributes outside the <tok> are to be exported to the CQP corpus. Each <item> describes the tag that is to be exported, in this example only <text> which is the whole text of each XML file. Even when not defined, <text> level will be exported since these are required for looking up the results of the CQP query in the XML file.

Within each sattribute level, there is another <item>, where the @key is the name of the field to be used in CQP, so the @i inside the <text> will correspond to the CQP sattribute text_id.

The value of the field can be defined by a feature @xpath, which defines an XPath 1.0 query relative to the node. These are mostly useful for <text> level attributes, where they should always refer to the root, otherwise only information inside the <text> element will be returned, whereas most text-level information will be in the teiHeader. In the example, the value of text_year is given by the value of the attribute @n in the <date> of the <sourceDesc>, which is the numeric representation of the date of publication of the text.

When no xpath query is given, the value will be the value of the attribute on the XML element itself, so @rend on a <hi> would export the value of the attribute @rend, which in TEI is the rendition style (bold, italics, etc).

Each item can have a @display which is the name of the attribute, and a @long which is an (optional) long name.

The type feature indicates how the attribute is displayed in the search field: without a type, it will display as a text box, with the value range it will display two boxes for the lower and upper limit of the range to search in, and kselect and select both display a pull-down list with all values taken directly from the CQP index. The difference between the two is that select directly displays the value of the field, whereas kselect show the value as something to be internationalized: for Spanish text with the value "es" as the value for lang, kselect will display the value Spanish as the value, which can then be defined as Spanish, EspaƱol, etc. in the i18n module.

It is possible to use XML references in XPath lookups by using an attribute @external. Say we have a file authors.xml which contains all our authors (expected to be located in Resources), and our individual XML files refer to this by say <author @key="author.xml#John"/>. Then we can define @external="//author", which will lookup the value of the author in the current XML file. It will then open the file author.xml, and look for an item with @id="John". The XPath in that case will be executed relative to the @id="John" field in our authors.xml.

Stand-off Annotations

CQP SAttributes can also be incorporated from stand-off annotation files. For stand-off annotations, the @key is the name of the attribute you want to use in CQP for the stand-off annotation, and @filename is the name of the file in Annotations where the annotations are stored.

For each stand-off annotation file, you can indicate which attributes on the annotation segment to export to CQP. This is done much like pattributes, with a list of <item> elements that name the attribute to be exported as the @key

Back to index