Multi-media edition; tools of analysis;
standard linguistic measures for validation in HLT
-
The aim of the C-ORAL ROM project is
to provide the linguistic community and speech industry with a comparable
set of corpora of spontaneous spoken language for the main romance languages,
namely French, Italian, Portuguese and Spanish.
-
The Spoken Romance
Corpus is a sampling of Spontaneous Spoken Language, recorded in
free situations with roughly 300,000 words for each Language.
-
Corpora are mainly
extracted from the language resources that each partner has already set
up, and which probably constitute the main ones now available in each country.
-
Textual information and sound source
are delivered aligned in a DVD Multimedia edition and are integrated with
High Performance Tools, for both Sound and Text Analysis.
-
The Corpus edition
is associated with:
-
models of Spoken
Language and Standard Linguistic Measures derived from Corpora Analysis;
-
Spoken Romance
Language Syntactic Structure Comparison;
-
Text Gender Classification
based on their internal linguistic properties.
Validation of C-ORAL-ROM corpora is performed by
leader users on actual speech recognition tools and distributed
in both the academic and industrial market sectors.
|