Utterance, prosody and multimedia Access to acoustic information and representation of Phonetics and Prosody in a multimedia corpus

For a better approach to the linguistic information of spoken language, the speech software WinPitch Corpus integrates the multimedia resource, thus ensuring text-sound alignment and simultaneous acoustic analysis :

  • functions for sound/text alignment: text-tags insertion based on sound-wave tag;
  • slow down of the acoustic signal for an easy and precise tag inserction;
  • real-time sound-signal analysis with respect to main vocal parameters ( Fo, duration, intensity, spectrum) for long signals (unlimited);
The conception of C-ORAL-ROM multimedia storage of spoken language resources is based on the selection of a natural alignment unit that is also identified as a basic tagging level in textual corpora i.e. utterance
  • word based alignment is meaningless for prosodic reasons: words are co-articulated in prosodic units and the acoustic effect of a word based alignment is perceptively unnatural
  • syllable based alignment is extremely expensive and the aligned units are not a meaningful linguistic entity (syllables do not have a meaning)
  • the utterance based alignment is both meaningful from a linguistic point of view and natural from a perceptual point of view.
In C-ORAL-ROM all the textual information is tagged simultaneously with respect to prosodic parsing and utterance limit: each prosodic unit corresponding to an utterance will turn out aligned to its textual counterpart.

A careful study of prosody for the accomplishment of an utterance based alignment is one of the main feature of the C-ORAL-ROM Project. The result is extremely significant for the exploitation of the resulting resource: C-ORAL-ROM can be seen as a data base of natural utterances

The exploitation of such a data base is relevant for syntactic properties, prosodic properties, action value properties, lexical properties of natural utterances at both acoustic and textual levels.

The utterance based alignment defined on highly prominent prosodic cues is a proposed standard for spoken multimedia archives

The selection of textual units corresponding to an utterance is based on highly identifiable prosodic properties that the linguistic entities corresponding to an utterances have at the perceptual level. The definition of utterance in spoken language is theoretically defined.