Integrated reference corpora for spoken romance languages

Multi-media edition; tools of analysis; standard linguistic measures for validation in HLT

  • The aim of the C-ORAL ROM project is to provide the linguistic community and speech industry with a comparable set of corpora of spontaneous spoken language for the main romance languages, namely French, Italian, Portuguese and Spanish.
  • The Spoken Romance Corpus is a sampling of Spontaneous Spoken Language, recorded in free situations with roughly 300,000 words for each Language.
  • Corpora are mainly extracted from the language resources that each partner has already set up, and which probably constitute the main ones now available in each country.
  • Textual information and sound source are delivered aligned in a DVD Multimedia edition and are integrated with High Performance Tools, for both Sound and Text Analysis.
  • The Corpus edition is associated with:
    • models of Spoken Language and Standard Linguistic Measures derived from Corpora Analysis;
    • Spoken Romance Language Syntactic Structure Comparison;
    • Text Gender Classification based on their internal linguistic properties.
    Validation of C-ORAL-ROM corpora is performed by leader users on actual speech recognition tools and distributed in both the academic and industrial market sectors.