<TEI:TOK>

Main Menu


Powered by <TEI:TOK>
Maarten Janssen, 2014-

TEITOK Help Pages

Working with spoken data

TEITOK can work with spoken corpora of various types: not only textual corpora consisting of transcribed oral data, but also proper oral transcriptions, and even time-aligned transcritions. And it can do this either for an entirely oral corpus or for a mixed corpus in which some files are spoken while others are written. TEITOK can display the audio file on top of the text so that you can listen to the text, and that can in principle even be the audio of a video file. In order to make TEITOK properly work with spoken data, there are several parts that should be configured, and this page gives an overview of the important aspects. All of these are general settings - there is also a dedicated interface for time-aligned spoken data: the wavesurfer interface

Recommended XML codes

XML files in TEITOK follow the TEI/XML guidelines. The section on spoken data in the TEI guidelines is very sparse compared to the rest of the framework, and the description not always clear. Below is a short of of the codes established as best practice amongst the spoken project in TEITOK. The codes with a slash behind it such as <pause/> are self-closing tags without any content inside, while the other tags say something about whatever is between the opening and closing tag, so <unclear>Betty</unclear> represent a segment where the speaker probably says "Betty". All these codes are typically used inside utterances: <u>

XMLExplanation
<pause/>A pause - to differentiate between short and long pauses you can use type="long" and type="short"
<gap/>A gap in the transcription - you can put the motivation behind the gap in the reason, for instance reason="unintelligible"
<unclear>A segment that has been transcribed, but for which the transcription is uncertain
<del>A segment that has been "deleted", which is, corrected by the speaker. There are three typical types:
<del type="truncated"> - a truncated word / false start
<del type="repetition"> - a repeated segment where the speaker "stutters"
<del type="reformulation"> - where the speaker reformulates what he started
<vocal>An extralinguistic element - the description goes inside: <vocal><desc>Uhm</desc></vocal> stands for an "uhm" sound.
<kinesic>A kinesic element - the description goes inside: <kinesic><desc>Clap</desc></kinesic> stands for a clapping sound.
<anon>An anonymized segment - with an option type, say type="person". Can be used later to blank out the corresponding segments in the audio file semi-automatically

Audio search results

For time-aligned spoken data, TEITOK can render search results that allow you to immediately listen to the corresponding audio. In order for this to work, a number of items need to be in place:

  • The transcription should be divided into utterances
  • Each utterance should have a @begin and an @end to indicate the time interval
  • The utterance with the time interval should be exported to the CQP corpus
  • The header should have a recording element that points to the audio file
  • The name of the audio file should be exported to the CQP corpus
  • The default result view should be set to context=utterance

With all these properly set up, the search result will present search results as utterance, with the matching tokens highlighted, and in front of each utterance a play button that will ask the browser to load the audio file and play the segment corresponding to the utterance.

Symbol-based rendering

In spoken data, much linguistic mark-up was traditionally using special symbols - so for instance, truncated words were often followed by a & sign to indicate the word was truncated: trunc&. In TEI, truncated speech is marked up as <del type="truncation">. Since we can use colors in a browser, there is no need to restrict mark-up rendering to symbols, but that is not to say we cannot use symbol-based rendering: the teitok.css file in fact by default uses symbol-based rendering for truncations, in the following way:

#mtxt del[type=trunctaion]::after { content: '&'; }

This will put a pseudo-element after any <del type="truncation"> node, filled with the content '&'. So this will render <del type="truncation">trunc</del> as above: trunc&. Further CSS styles then make sure that any <del>, including the &, is displayed in grey.

This provides a very convenient way to display mark-up in a way that is familiar to the target audience, but where the symbol-based mark-up is generated by the computer, and hence does not interfere with the text. However, there is a small complication: when the content of the deleted element is suppressed, which is what by default happens in the @form view, the pseudo-element is not suppressed, meaning the & will stay as a ghost element. This is not something that can be solved in CSS, and hence TEITOK provides a Javascript solution that can be used in CSS: when changing views, any token that no longer has any content is adorned with an attribute empty="1", which can then be used by CSS to suppressed the associated pseudo-elements:

#mtxt *[empty=1] { display: none; }


Back to index