Comparability issue in multilingual resources of spontaneous speech


Comparable corpora are essential to the development of a multilingual Language resource but comparable corpora are usually set up in two ways, both unsatisfactory for a correct representation of spontaneous spoken language:

1) Parallel corpora

Their acoustic/phonetic quality is excellent and the comparability is perfect but it is of course impossible to realise parallel corpora without losing spontaneity

2) Resources collected in controlled environments with a restrict semantic domain (telephone information, health information, etc)

Their acoustic/phonetic quality is excellent but they do not represent spontaneous speech for two connected reasons:

  1. the linguistic behaviours in controlled environments are highly predictable
  2. they do not represent the structural variety of speech at syntactic/semantic/pragmatic level and crucially, for what regards intonation

Spontaneous speech characterised by:

  • variable sound quality;

  • face-to-face dialogue;

  • mental programming simultaneous with vocal execution;

  • contextually undetermined linguistic behaviour (unpredictable i.e. free).

  • as far as the spontaneous linguistic behaviour is unpredictable it cannot be repeated;

Variability is the main property of spontaneous spoken texts and the representation of spontaneous speech must deal with it

As a consequence of it in a multilingual resource the more variability is represented in each language resource the more the language resource is hard to be said comparable with the others.

The solution presented in C-ORAL-ROM is base on the definition of a set of variation parameters, mainly regarding features of speakers and context of use, that have been demonstrated to be responsible for the variability of spoken language in a long tradition of sociolinguistic studies.

Variation parameters of spoken language in C-ORAL-ROM