|
Comparable corpora are essential to the development of a multilingual Language resource
but comparable corpora are usually set up in two ways, both
unsatisfactory for a correct representation of spontaneous spoken
language:
1)
Parallel corpora
Their
acoustic/phonetic quality is excellent and the comparability is
perfect but it is of course impossible to realise parallel corpora
without losing spontaneity
2)
Resources collected in controlled environments with a restrict
semantic domain (telephone information, health information, etc)
Their
acoustic/phonetic quality is excellent but they do not represent
spontaneous speech for two connected reasons:
- the linguistic behaviours in controlled environments are
highly predictable
- they do not represent the structural variety of speech at
syntactic/semantic/pragmatic level and crucially, for what regards
intonation
Spontaneous
speech characterised by:
variable sound quality;
face-to-face dialogue;
mental
programming simultaneous with vocal execution;
contextually
undetermined linguistic behaviour (unpredictable i.e. free).
as
far as the spontaneous linguistic behaviour is unpredictable it
cannot be repeated;
Variability
is the main property of spontaneous spoken texts and the
representation of spontaneous speech must deal with it
As
a consequence of it in a multilingual resource the more variability
is represented in each language resource the more the language
resource is hard to be said comparable with the others.
The
solution presented in C-ORAL-ROM is base on the definition of a set
of variation parameters, mainly regarding features of speakers
and context of use, that have been demonstrated to be
responsible for the variability of spoken language in a long
tradition of sociolinguistic studies.
Variation parameters of spoken language in C-ORAL-ROM
|