Representation of spontaneous speech variation and Sampling criteria

C-ORAL-ROM develops a highly innovative Multilingual spoken resource, simultaneously ensuring:

  • Representativeness of spontaneous speech,
  • Comparability of the four Romance language resources,
  • Simultaneous access to acoustic and textual information,
  • Reusability of textual and acoustic data.

The comparable Romance Spoken Corpus is identified by means of common Sampling criteria and same proportion of variation

  • Samples are about 300,000 words for each language
  • Same sociolinguistic variation
  • huge proportion of Spontaneous Informal Speech
  • Same proportion: Informal speech 50%, formal speech 35%, 15% MEDIA speech
  • Same degree of acoustic variation
Kick-off decision on the C-ORAL-ROM sampling of the informal corpus

Standard textual format:

  • EU Standards,
  • advisors’ recommendations
  • prosodic segmentation in utterances and tone units
  • XML and bare TXT macro convertion
  • Software for Textual Information Retrieval

Standard acoustic format:

  • non-compressed .wav format audio file, 22050 Hz-16 bit
  • WINPITCH CORPUS sound/text alignment