with respect to prosodic breaks. C-ORAL-ROM Terminal
and no t terminal
breaks, are annotated through perceptual judgments on the textual strings and tbe the crucial index that limit.
The level of reliability of C-ORAL-ROM and annotation scheme has been evaluated by an external institution (LOQUENDO, Telecom Italia) as requested to the consortium in the mid-term review
The four C-ORAL-ROM resources provide a textual entry where each token is tagged with a lemma and a PoS. PoS tagging and lemmatization has been accomplished automatically with existing tools and tag set, according with the different traditions of each corpus provider. In order to assure comparability within the whole corpus, a compulsory minimal threshold of information has been established in the tag codes and a common format defined. Specific problems concerning PoS tagging of spontaneous speech has been highlighted.
The four corpora has been the object of a parallel set of automatic analysis with respect to a restrict set of linguistic indexes that are assumed to be a significant for modeling spoken language complexity.
Data are based on the exploitation of the C-ORAL-ROM corpus design and linguistic annotations (utterance limit, prosodic tagging; PoS tagging) and provide a quantitative basis for the induction of comparative values for spoken language in the romance and
The C-ORAL-ROM multilingual corpus offers a representation of the main context of use of the spoken domain for French, Italian, Portuguese and Spanish and bares a set of relevant linguistic information suitable for modeling spoken language variability.
The average and the variation coefficient of the following standard variation parameters has been calculated through the corpus design structure:
A multilingual system for Automatic Train Information has been accomplished by ITC-irst (Italian, French and Spanish). A sampling of the phone calls used for training the system has been transcribed in C-ORAL-ROM format with checking of speech recognition errors.
The transcripts has been inserted in the corpus (10.000 words for the Italian and Spanish sub-corpora).