Multi-media edition; tools of analysis;
standard linguistic measures for validation in HLT
The aim of the C-ORAL ROM project is
to provide the linguistic community and speech industry with a comparable
set of corpora of spontaneous spoken language for the main romance languages,
namely French, Italian, Portuguese and Spanish.
The Spoken Romance
Corpus is a sampling of Spontaneous Spoken Language, recorded in
free situations with roughly 300,000 words for each Language.
Corpora are mainly
extracted from the language resources that each partner has already set
up, and which probably constitute the main ones now available in each country.
Textual information and sound source
are delivered aligned in a DVD Multimedia edition and are integrated with
High Performance Tools, for both Sound and Text Analysis.
The Corpus edition
is associated with:
models of Spoken
Language and Standard Linguistic Measures derived from Corpora Analysis;
Language Syntactic Structure Comparison;
Text Gender Classification
based on their internal linguistic properties.
Validation of C-ORAL-ROM corpora is performed by
leader users on actual speech recognition tools and distributed
in both the academic and industrial market sectors.