C-ORAL-ROM
Corpora (roughly 150 hours) are completely tagged
with respect to prosodic breaks. C-ORAL-ROM Terminal
and non-t terminal
breaks, are were
annotated through perceptual judgments on the textual strings and terminal breaks are assumed
to be the crucial index that roughly mark the utterance limit.
The level
of reliability of the C-ORAL-ROM
prosodic labeling and annotation scheme
has been evaluated by an external institution (LOQUENDO, Telecom Italia) as
requested to the consortium in the mid-term review
The four C-ORAL-ROM resources provide a textual entry
where each token is tagged with a lemma and a PoS. PoS tagging and
lemmatization has been accomplished automatically with existing tools and tag
set, according with the different traditions of each corpus provider. In order
to assure comparability within the whole corpus, a compulsory minimal threshold
of information has been established in the tag codes and a common format
defined. Specific problems concerning PoS tagging of spontaneous speech has
been highlighted.
Linguistic studies on
spoken corpora
The four
corpora has been the object of a parallel set of automatic analysis with respect to a restrict
set of linguistic indexes that are assumed to be a significant for modeling
spoken language complexity.
Data are
based on the exploitation of the C-ORAL-ROM corpus design and linguistic
annotations (utterance limit, prosodic
tagging; PoS tagging) and provide a quantitative basis for the induction of
comparative values for spoken language in the romance and
Measurements of
spoken language variability
The
C-ORAL-ROM multilingual corpus offers a representation of the main context of use of the spoken domain for
French, Italian, Portuguese and Spanish and bares a set of relevant linguistic
information suitable for modeling spoken language variability.
The average
and the variation coefficient of
the following standard variation parameters has been calculated through the
corpus design structure:
· Fragmentation
A multilingual system for Automatic Train
Information has been accomplished by ITC-irst (Italian, French and Spanish). A sampling of the phone calls used
for training the system has been transcribed in C-ORAL-ROM format with checking of speech recognition
errors.
The transcripts has been inserted in the corpus (10.000
words for the Italian and Spanish sub-corpora).