Integrated reference corpora for spoken romance languages.
|Language resources description:|
The C-ORAL-ROM resource is a multilingual corpus of spontaneous speech for the main romance languages of around 1200000 words. The corpus collection is the result of the C-ORAL-ROM project, funded within the V EU Framework programme in the IST program (IST2000-26228). Official project web page and full documentation in http://lablita.dit.unifi.it/app/coralrom.
The corpus consists of four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions recording the following values:
| ||WAV files||GB||Duration||Utterances||Words||Speakers||Male||Female|
|FRENCH||206||3.77||26h 21' 43''||21010||295803||305||154||151|
|ITALIAN||204||5.19||36h 16' 10''||40402||310969||451||276||175|
|PORTUGUESE||152||4.43||29h 43' 42''||38855||317916||261||144||117|
|SPANISH||210||4.56||31h 6' 0''||35588||333482||410||247||163|
|The C-ORAL-ROM data bases are anonymous. The collections are delivered respectively by the following copyright holders:|
Italian Corpus: © Università di Firenze (Dipartimento di Italianistica, LABLITA)1;
French Corpus: © Université de Provence (Description Linguistique Informatisée sur Corpus);
Portuguese Corpus: © Fundação da Universidade de Lisboa/Centro de Linguística da Universidade de Lisboa2;
Spanish Corpus: © Universidad Autónoma de Madrid (Departamento de Lingüística, Laboratorio de Lingüística Informática)3.
|Each recorded session is stored in wav files (Windows PCM, 22050Hz. 16 bit). The C-ORAL-ROM corpus provides the acoustic source of each session together with the following main annotations:
This multimedia corpus comes with the speech software Win Pitch Corpus (© Pitch France. http://www.winpitch.com. Minimal configuration: Pentium III, 1 GHz, 256 mega Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only).
- The orthographic transcription, in CHAT format, enriched with the tagging of terminal and non terminal prosodic breaks, in TXT files
- Session metadata, in CHAT and IMDI formats
- Synchronization of each transcribed utterance to the acoustic source, in XML files
Additional label files are also provided to allow multitask exploitation of the resource:
- the purely textual corpus in .TXT and .XML format;
- the PoS tagging of all and the corresponding frequency list of lemmas forms in .TXT files;
- a set of linguistic measurements extracted from the main corpus annotations, in EXCEL files.
The resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. The resource has been designed for prosodic modelling, test bed procedures in HLT and corpus based studies of spontaneous speech.
C-ORAL-ROM is oriented towards the collection of spontaneous speech corpora in natural environment, despite the fact that this necessarily causes a lower acoustic quality of the resource. The recording conditions and the acoustic quality of the sessions collected in C-ORAL-ROM are variable. The quality scale extends from the highest level of clarity of the voice signal to low levels of acoustic quality.
The speech files of the acoustic database are defined on a quality scale (recording, volume, voice overlapping and noise).
The quality is gauged spectrographically and is always annotated in the metadata of each session together with the recording condition. Sessions in which F0 analysis is not significant are excluded from sampling.
- Digital recordings
with DAT or minidisk apparatus and unidirectional microphones or analogue recording of very high quality
- Digital recording with poorer microphone response or analogue recordings with:
- Good microphone response;
- Low background noise
- Low percentage of overlapped utterances;
- F0 computing possible in most of the file.
- Low quality analogue recordings with:
- Poor microphone response
- Background noise
- Average percentage of overlapped utterances
- F0 computing possible in many parts of the files
Spontaneous speech events are those communication events where the programming of speech is simultaneous to its execution by the speaker; i.e. the speech event is non-scripted or only partially scripted.
The corpus design of the C-ORAL-ROM resource aims to ensure a possibility of occurrence for a large variety of speech act typologies and natural prosodic contours, which are the most peculiar linguistic feature found in spontaneous speech. To this end the following main variation parameters of the spoken domain are represented in a corpus design schema, covering a wide range of semantic and pragmatic domains of application.
- Language register
- Informal: un-scripted low variety of language, used for everyday interactive purposes;
- Formal: partially-scripted task-oriented high variety of language.
- the means by which the signal transmission is achieved.
- Face to face communication: speech event among participants in the same unity of space and time with reciprocal direct multi-modal perception and interaction;
- Broadcasting: unidirectional speech emission to an undefined audience by devices that ensure, at least, the perception of voice;
- Telephone: bi-directional speech event by means of telephone.
- Structure of the communication event:
- role and nature of the participants in the speech event.
- Monologue: speech event with only one intervenient performing a main communication task1
- Dialogue: speech event with two intervenient
- Conversation: speech event with more than two intervenient
- Human-machine interaction: speech event between a human being and an electronic device
- Non-natural format: other; i.e. format of the broadcasting emissions (undefined in this resource)
- Social context:
- organization level of the society to which the speech event belongs.
- Family/private: speech event within the family, or private social context
- Public: speech event within a public social context
- Domains of application of the formal use of language:
- Domains of the formal use of languages in natural context: political speech; political debate; preaching; teaching; professional explanation; conference; business; law
- Domains of application in broadcasting emissions: news; sport; interviews; science; meteo (weather forecast); scientific press; reportage; talk_show
The sampling strategy of the spontaneous speech domain adopted in C-ORAL-ROM is based on the representation of the different types context of use and it is not balanced with regards to the speaker characteristics (geographical origin, age, sex, education, profession) that are however recorded in the metadata of each session.
The strategy significantly vary in accordance with the language register of samples. The definition of a finite list of typical domains of use is the main criterion applied in documenting the formal uses of the four romance languages, while variations in dialogue structure and social context of use is the sampling criterion of the informal part. The choice of the specific semantic domain of use is left random in the informal sampling.
In facts while it can be assumed that in western societies the formal use of language is applied in a closed series of typical domains, the same does not hold for the informal use of language. The list of possible domains of use for informal language is by definition open, and no domain can in principle be considered more typical than others. Under this assumption, the identification of the main domains of use of formal language maximizes the probability of representing the significant variations in this language variety, and is therefore the best strategy. On the contrary, if significant variations of informal spontaneous speech are to be considered, the same strategy will cause a reduction of their probability of occurrence.
Also the strategy regarding the text weight vary its significance considering the Formal and in the Informal use of language. The formal use of language feature in general long textual structure, while in the informal the length of syntactic construction is limited. Therefore in order to ensure the probability of occurrence of typical structures the text length for the Formal sampling must be significantly longer.
Corpus design matrix|
The four language collection are comparable as far as they fit with the corpus design schema. More specifically each language collection in the C-ORAL-ROM corpus is consistent with the following average structure:
- INFORMAL 150000 words from at least 64 texts of 1500 words each and 10 texts of 4500 words each
- Family-Private context 124500 words
- Monologues 42000 words
- Dialogues-Conversations 82500 words
- Public context 25500 words
- Monologues 6000 words
- Dialogues-Conversations 19500 words
- FORMAL 150000 words
- Formal in natural context 2 or 3 samples of 3000 words average for each of the following typical domain of use for 65000 words in total.
- political speech
- political debate
- professional explanation
- Media context 2 or 3 samples of 3000 words average for each of the following typical domain of use for 60000 words in total.
- news (small sample)
- weather forecast (small sample)
- scientific press
- sport talk shows
- political debate
- thematic discussions
- Telephone 25000 words4
- private conversations
- human-machine interaction (10000 words)5
For each session a rich series of metadata is delivered in CHAT and IMDI format, ensuring multitask exploitation of the resource for Linguistics and Human language technologies. Metadata contain essential information regarding the speakers, the recording situation, the topic, the acoustic quality, the source of the collected data.
Corpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney, 1994) with the annotation of speaker's turns. The textual string is divided into utterances. The main non linguistic and paralinguistic acoustic events in the speech flow are reported into transcripts.
The four romance collections are completely tagged with respect to prosodic breaks. Terminal and non terminal breaks, are discriminated through perceptive judgments and reported in the transcripts. The level of inter-annotator agreement on prosodic tags assignment has been evaluated by an external institution (LOQUENDO, Turin. Results avaliable in http://lablita.dit.unifi.it/coralrom/loquendo).
The multimedia storage ensures a natural and meaningful text/sound correspondence for both prosodic modeling, test bed procedures and corpus based studies of spontaneous speech.
WinPitch Corpus is an innovative software program for computer-aided alignment of large corpora. It provides a method for easy and precise selection of alignment units, ranging from syllable to whole sentences in a hierarchical storing system of aligned data. Segments derived from alignment can be defined on 8 independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. Besides text to speech alignment, WinPitch Corpus, which is Unicode compliant, has numerous features allowing easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc...
The C-ORAL-ROM resource is stored in DVDs and is distributed in two forms:
- Publication by Benjamins Publishing Company, in the Corpus Linguistics Studies serie. In this form the resource is stored in one DVD only and is sold with a Book describing the resource. In this DVD the speech files are compressed in MP3files. Speech files and label files are also encrypted to prevent duplication and modification by the user. Files and can be accessed only through the programs delivered in the DVD6.
- Distribution through European Language Resource Distribution Agency (ELDA, Paris). In this form speech files are in non-compresses and non-encrypted wav files. The resource is stored in 8 DVDs containing the Multimedia C-ORAL-ROM edition and an additional DVD9 containing the additional label files.
Prof. Emanuela Cresti
Co-ordinator of the C-ORAL-ROM project
University of Florence
Piazza Savonarola, 1
phone: +39 055 5032486
fax: +39 055 503247
1. University of Florence acknowledges that the source of the sound files in media recording was kindly provided by TECHE RAI, for the uses foreseen in the C-ORAL-ROM project only.|
2.Fundação da Universidade de Lisboa acknowledges that the source of the sound files in media recordings was kindly provided by RTP2: RÁDIO E TELEVISÃO DE PORTUGAL; SGPS, S.A. ; RDP: RADIODIFUSÃO PORTUGUESA, AS. (Antena 1 and Antena2); SIC SOCIEDADE INDEPENDENTE DE COMUNICAÇÃO, S.A.; RÁDIO NOTÍCIAS PRODUÇÕES E PUBLICIDADE, for the uses foreseen in the C-ORAL-ROM project only.
3.Universidad Aut&ocute;noma de Madrid acknowledges that the source of the sound files in the media section of the Spanish sub-corpus was kindly provided by RTVE (Radio Televisión Española), Radio Televisión Madrid, COPE (Cadena de Ondas Populares Españolas/Radio Popular) and Onda Cero Radio, for the uses foreseen in the C-ORAL-ROM project only.
4.Text length not defined (by preference 1500 words upper limit, no lower limit.
5.Field not present in the Portuguese corpus. The texts in this field are not delivered aligned to the acoustic source.