C-ORAL-ROM Project
IST-2000 26228
ANNUAL REPORT 2002
The
multilingual C-ORAL-ROM corpus delivered in year 1 has been the object of
internal and external assessment to
ensure comparability among Italian, France, Portuguese and Spanish corpora.
Assessment has regarded corpus target and metadata structure.
Corpora have been delivered internally, but are also published in a demo
version.
All the textual information in C-ORAL-ROM(1.200.000 words) is tagged with respect to t
erminal and non-terminal
prosodic breaks. The
uniform application of prosodic tagging criteria by C-ORAL-ROM transcribers has been tested and the
prosodic tagging revised.
Measurements of spoken language variability with respect to the corpus
structure has been generated on the basis of the prosodic tagging and are
available on the net.
The
validation of prosodic tagging by independent operators is foreseen in year 3.
C-ORAL-ROM
crucially foresees the synchronization of each transcribed utterance with the
corresponding acoustic signal and the simultaneous generation of the data bases
of all utterances in the resource. The main part of the work of year 2 has been
devoted the accomplishment of alignment through the speech software Win Pitch
Corpus, which provides full exploitation of both textual and acoustic
information, and xml entry.
The
DTD for the C-ORAL-ROM textual format and a PERL script for xml
conversion of txt files has been realized.
The development of a multilingual
automatic service for train information is in an advanced phase. The collection
of human machine dialogue for Italian is terminated. The collection for French
and Spanish is under constitution.
The project has been disseminated at LREC, 2002.
Corpus assessment and metadata
Prosodic segmentation and standard
measurement of spoken language variability
Text to speech synchronization
Standard textual and acoustic
entries
Dissemination http://lablita.dit.unifi.it/coralrom/papers/index.html