C-ORAL-ROM develops a highly
innovative Multilingual spoken resource, simultaneously
ensuring:
- Representativeness
of spontaneous speech,
- Comparability
of the four Romance language resources,
- Simultaneous
access to acoustic and textual information,
- Reusability
of textual and acoustic data.
The
comparable Romance Spoken Corpus is identified by means
of common Sampling criteria and same proportion of
variation
- Samples
are about 300,000 words for each language
- Same
sociolinguistic variation
- huge
proportion of Spontaneous Informal Speech
- Same
proportion: Informal speech 50%, formal speech
35%, 15% MEDIA speech
- Same
degree of acoustic variation
Kick-off decision on the C-ORAL-ROM sampling of the informal corpus
Standard
textual format:
- EU
Standards,
- advisors
recommendations
- prosodic
segmentation in utterances and tone units
- XML
and bare TXT macro convertion
- Software
for Textual Information Retrieval
Standard
acoustic format:
- non-compressed
.wav format audio file, 22050 Hz-16 bit
- WINPITCH
CORPUS sound/text alignment
|