C-ORAL-ROM Project

IST-2000 26228

 

ANNUAL REPORT 2002

 

 

Summary

 

The multilingual C-ORAL-ROM corpus delivered in year 1 has been the object of internal and external assessment  to ensure comparability among Italian, France, Portuguese and Spanish corpora. Assessment has regarded corpus target and metadata structure. Corpora have been delivered internally, but are also published in a demo version.

All the textual information in C-ORAL-ROM(1.200.000 words) is tagged with respect to t

erminal and non-terminal prosodic breaks. The uniform application of prosodic tagging criteria by C-ORAL-ROM  transcribers has been tested and the prosodic tagging revised.

Measurements of spoken language variability with respect to the corpus structure has been generated on the basis of the prosodic tagging and are available on the net.

The validation of prosodic tagging by independent operators is foreseen in year 3.

C-ORAL-ROM crucially foresees the synchronization of each transcribed utterance with the corresponding acoustic signal and the simultaneous generation of the data bases of all utterances in the resource. The main part of the work of year 2 has been devoted the accomplishment of alignment through the speech software Win Pitch Corpus, which provides full exploitation of both textual and acoustic information, and xml entry.

The  DTD for the C-ORAL-ROM textual format and a PERL script for xml conversion of txt files has been realized.

The development of a multilingual automatic service for train information is in an advanced phase. The collection of human machine dialogue for Italian is terminated. The collection for French and Spanish is under constitution.

The project has been disseminated at LREC, 2002.

 

Corpus assessment and metadata

Prosodic segmentation and standard measurement of spoken language variability

Text to speech synchronization

Standard textual and acoustic entries

Dissemination http://lablita.dit.unifi.it/coralrom/papers/index.html