C-ORAL-ROM Project

(IST2000 26228)

Annual Public Report 2003

 

SUMMARY

 

 

Evaluation of C-ORAL-ROM prosodic tagging

 

C-ORAL-ROM Corpora  (roughly 150 hours) are completely tagged with respect to prosodic breaks. C-ORAL-ROM Terminal and non-t terminal breaks, are were annotated through perceptual judgments on the textual strings and terminal breaks are assumed to be the crucial index that roughly mark the utterance limit.

The level of reliability of the C-ORAL-ROM prosodic labeling and annotation scheme has been evaluated by an external institution (LOQUENDO, Telecom Italia) as requested to the consortium in the mid-term review

 

Automatic lemmatization and morpho-syntactic tagging of the Italian, French, Spanish and Portuguese corpora

The four C-ORAL-ROM resources provide a textual entry where each token is tagged with a lemma and a PoS. PoS tagging and lemmatization has been accomplished automatically with existing tools and tag set, according with the different traditions of each corpus provider. In order to assure comparability within the whole corpus, a compulsory minimal threshold of information has been established in the tag codes and a common format defined. Specific problems concerning PoS tagging of spontaneous speech has been highlighted.

 

 

 

Linguistic studies on spoken corpora

 

The four corpora has been the object of a parallel set of automatic analysis with respect to a restrict set of linguistic indexes that are assumed to be a significant for modeling spoken language complexity.

Data are based on the exploitation of the C-ORAL-ROM corpus design and linguistic annotations  (utterance limit, prosodic tagging; PoS tagging) and provide a quantitative basis for the induction of comparative values for spoken language in the romance and

 

 

 

 

 

 

Measurements of spoken language variability

 

The C-ORAL-ROM multilingual corpus offers a representation of the main context of use of the spoken domain for French, Italian, Portuguese and Spanish and bares a set of relevant linguistic information suitable for modeling spoken language variability.

The average and the variation coefficient  of the following standard variation parameters has been calculated through the corpus design structure:

·         Fragmentation

 

 

 

Human - Machine Interactions

 

A multilingual system for Automatic Train Information has been accomplished by ITC-irst (Italian, French and Spanish). A sampling of the phone calls used for training the system has been transcribed in C-ORAL-ROM format with checking of speech recognition errors.

The transcripts has been inserted in the corpus (10.000 words for the Italian and Spanish sub-corpora).