Main Menu

Powered by <TEI:TOK>
Maarten Janssen, 2014-

TEITOK Help Pages

Batch Edit

TEITOK is meant not only to distribute corpora, but also to maintain and edit them. Editing in TEITOK is made easy: if you spot an error in any type of annotation, you can just click on the word and correct the error. You can not only do that in the document view, but also from the KWIC list, meaning you can search for very specific contexts where you know there are errors. And you can edit in the verticalized view to edit fields in a more structural way. But the most powerful editing mode in TEITOK is the multi-edit: you use a CQL query to edit multiple tokens in one go.

The most simple use of this works as follows: say we spot that all occurrences of the word betwixt have been marked as a noun - whereas they all should have been prepositions. So you go to search (index.php?action=cqp) and use the query [form="betwixt"]. This will give you the list of all the occurrences of betwixt in the corpus. If you are logged in, there will be a link Use this query for multi-token edit on the bottom of the search list. And clicking that bring you to a page where you can type in corrections for all those occurrences. So if we want to change the part-of-speech to PREP, you just type in PREP in the corresponding field (typically pos or upos), and all those occurrences will be changed to PREP independently of the value they had before.

The multi-edit, however, does not let you just change a lot of occurrences - it forces you to verify them. With that, it would be too easy to unwantingly make incorrect changes to your corpus, after which is it hard to correct them back. So it show a list of the first occurrences, and you have to confirm which of those should be modified - there is a "select all" button on the bottom. This is mostly to make you check that you search was not too broad - the word betwixt can also be an adverb in constructions like in betwixt. So in this case, you can select all occurrences except those that are adverbs.

The list only shows the first occurrences - typically the first 500 although you can change that. That is not so much because it is hard to verify a large list, but more technical in nature: the correction works via a POST request in HTML, and there is a hard limit on 1000 items per POST request. To do the next batch, you either have to jump to the next selection, or change the first batch first, reindex the corpus, and then select again - which only works if the change would remove the items from the search results, so for that we would need to refine the query to [form="betwixt" & pos != "PREP"]. You need to reindex because the search is done in the indexed CQP corpus, not in the XML files, so untill you reindex, the changes will not be reflected in the search - which is also why there is a larger warning text above the page.

Contextual Search

A more complex use of the multiedit module is to do contextual searches: we only want to change certain tokens, say all the occurrences of betwixt after a preposition. The search for that is simple enough: [pos="PREP"] [form="betwixt" & pos != "PREP"]. But running that query will put a warning on the bottom: (Query cannot be used for multi-token edit since all results span more than one word). That is because we are now looking for two words, and the system does not know which of the two you want to edit. It is all because there are queries like [pos="ADJ"]+ that will sometimes find one token, but in other occasions more than one. In those cases, the system will only let you edit the results that have a one-token results.

To use multiedit in multi-token queries, we need to explicitly tell the system which word we are trying to edit. You do this with the CQP target function, where you indicate the target in a search by adding a @ in front of it. So the correct search for multiedit would in this case be: [pos="PREP"] @[form="betwixt" & pos != "PREP"]. By explicitly indicating the target token, you can now select multiedit on this query, which will only affect the second token.

Individual changes

Instead of refining our search, we can also choose to go through the list by hand - so if we know that betwixt is often done wrong, but there is no easy way to determine what the correct tag would be (which is for instance the case for the Spanish que), then we search for [form="betwixt"], click on multi-edit, and then instead of changing things directly, we select the Click here to enter individual values for each result link. This will first ask you what you want to change - in this case the pos. By selecting that, the system will show you all the occurrences, with an edit box behind, where you can type in the correct tag - either PREP or ADV in this case.

The individual change also lets you provide a systematic change first - if we want to correct the lemma for all occurrences of [form="e[ea]pt" & pos="VERB"] in the corpus, because the tagger somehow kept the t in the lemma (so lemmatizing leapt to leapt instead of leap), we can have the system make the edit easier by already removing the t at the end if there is one using a regular expression: s/t$//g.

Back to index