Corpus assessment and metadata

 

The consistence of the Italian, France, Portuguese and Spanish sampling delivered in the first year with respect to the C-ORAL-ROM corpus design has been the object of internal and external assessment  (reciprocally by the C-ORAL-ROM partners by the Advisory and Assessment board).

 

The goal of this work is to ensure comparability of the linguistic data in the multilingual set of corpora  and to allow their validation.

 

The following modification to the C-ORAL-ROM corpus presented in the first annual report has been accomplished:

 

§         The corpus target of the Italian, Spanish and Portuguese corpora delivered in year 1 has been object of minor revisions 

§         Metadata of sessions has been made consistent with present best practices

 

 

Corpora has been delivered internally but are already public in a demo version in http://lablita/~cromdemo/

 

 

 

 


 

Corpus target of  the C-ORAL-ROM catalogue

 

 

Informal

Sampling strategy: Strict definition of the variation through social contexts of use and structure of the communication event. Open set of domain of application.

 

 

 

150.000 words - at least 74 texts (64 x 1500 w. + 10 x 4.500 w.)

           

Family/Private  context 

124.500 words

Public context 

25.500 words

 - public

- scripted

+ public - scripted

 - public  + partially scripted

Monologues

42.000 W.

Dialogues/Conversation 82.500 W.

Monologues

 6.000 W.

Dialogues/Conversations 19.500 W.

                       

 

Formal

Sampling strategy : Definition of closed set of canonical domains of application

 

150.000 words – at least 42 texts of 3000 W.+ a sampling of phone calls

Formal in natural context  65.000 w.

Formal in media context    60.000 w.

 

Telephone 25.000 w

+ public

+ partially scripted

+ public

+ partially scripted

 

political speech

news

private conversations

political debate

meteo

phone calls to call services (man-machine interaction)

preaching

interviews

 

teaching

reportage

 

professional explanation

scientific press

 

conference

sport

 

business

talk shows political debate

 

law

talk shows thematic discussions

 

 

talk shows culture

 

 

talk shows science

private conversations

phone calls to a call service (human-machine interactions)

 

 

 

 

 

 

 

 

 

 


 

 

 

 

C-ORAL-ROM metadata.

 

Metadata of the C-ORAL-ROM resource are provided in both txt and xml format

 

 

•Metadata in the headers of each transcription can be mapped over IMDI Metadata for Session Description http://www.mpi.nl/ISLE/documents/draft/ISLE_MetaData_2.5.pdf

 

 

 

All data regarding  the C-ORAL-ROM resource as a whole as detailed in the corpus structure can be mapped over IMDI Metadata for Catalogue Description http://www.mpi.nl/ISLE/documents/draft/IMDI_Catalogue_2.1.pdf

 

 

 

The C-ORAL-ROM collection of metadata is detailed in a set of definitions and rules


 

Definitions and rules for  Metadata. 

 

 

Type label

Definition

@Title

One or two word,. it should help to recognize the text (in the object language)

@File:

Name of the file. The name of audio file and the text file differ only in extension.

@Participants:

Three capital letters identifying each speaker, followed by the corresponding proper name (first name) plus a  sub field with an  ordered set of information about the speaker.

@Date:

separated by slashes; e.g. 20/06/2001

@Place:

City of the recording

@Situation:

Ordered set of information separated by coma: gender, role of participants in the situation, place of recording, main action performed, recording conditions

@Topic

The main argument dealt with in the speech event (max 50 characters) for e.g. problems with traffic

@Source:

Name of the collection leading to a copyright holder; e.g.  CORPAIX; LABLITACORPUS;  TECHE-RAI

@Class:

the set of fields about the text class in accordance with the C-ORAL-ROM corpus target  (separated by commas)

@Length:

Length of the transcribed audio file in minutes(’) and seconds (”) e.g.: 12’ 15”

 

@Words:

Number of words in the text file

@Acoustic_quality

A B or  C in accordance with general criteria

@Transcriber:

Name of the responsible for the text, who can provide further information

 @Revisors

Names of the revisors

 @Comments:

Transcriber's comments about the text

 

 

Rules for sub fields

 

Participant

Label

Definition

Sex

man or woman

Age

One capital letter: A (18-25); B (25-40); C (40-50) D (>60)

Education

 

One number:  1 (primary school or illiteracy); 2 (high school)  3 (graduated or university students)

Profession:

Name of the profession

Role:

Role in the recorded event (even if it is equal to the profession)

Geographical origin/linguistic influence

Name of the region

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Rules for the Situation field

1 (Gender)

Information that helps to define the gender of activity that defines the linguistic event e.g. (gossip; chat; quarrel; discussion; narration; claim, etc). The neutral case is "talk". The information of the Class  field (dialogue, conversation etc.) should not be repeated.

2 (Role of participants in the situation)

The reciprocal roles of the participants (e.g.friends, colleague, relatives)

3 (Place of recording)

The place where the recording take place (e.g in the silent studio; in the street; at home; in a shop, at school, in the office, etc.)

4 (Main action performed)

Main action performed (if any)

5 (Recording conditions)

Status of the recording with respect to the “Observer paradox” in spontaneous speech resources. A choice in the following two sets of labels, separated by coma: hidden / not hidden, researcher participant / researcher observer / researcher not present.

 

 e.g. gossips between friends at home during the dinner, not hidden, researcher participant

 

For media corpora the situation field is filled with  the name of the program.

 

 

Rules for the sub field acoustic quality

Texts in which F0 analysis is not significant are excluded from sampling

 

Label

Definition

A

Digital recordings

   ( Unidirectional microphone)

B

Analogue recordings:

Good microphone response,

Low background noise

Low percentage of overlapped utterances;

F0 computing possible in most of the file despite possible disturbing factors

C

Low quality analogue recordings.

    

F0 computing possible in many parts of the files

 Mediocre  microphone response

Mid - percentage of overlapped utterances;  

 

Rules for the Sub field Class

 

 

Ordered set of sub-fields for text classification in the catalogue (separated by coma)

 

Class

informal

Class

formal

Type:

family/private

public

Type:

family/private

public

Type:

formal in natural context

Type:

media

Type:telephone

Sub-type: monologue

dialogue

conversation

Sub-type: monologue

dialogue

conversation

Sub-type:

political speech

political debate

preaching

teaching

professional explanation

conference

business

law (through media)

 

 

Sub-type:

news

sport

interviews

meteo

scientific press

reportage

talk shows

political debate

thematic discussions

culture

science

 

Sub type:

private conversation

phone calls to call services

 

 

 

 

sub-sub-type:

monologue

dialogue

conversation

 

 

 

 

 

Back to 2002 Annual Report Hompage