Corpus assessment and metadata
The
consistence of the Italian, France, Portuguese and Spanish sampling delivered
in the first year with respect to the C-ORAL-ROM corpus design has been the
object of internal and external assessment
(reciprocally by the C-ORAL-ROM partners by the Advisory and Assessment
board).
The
goal of this work is to ensure comparability of the linguistic data in the
multilingual set of corpora and to
allow their validation.
The
following modification to the C-ORAL-ROM corpus presented in the first annual
report has been accomplished:
§
The corpus
target of the Italian, Spanish and Portuguese
corpora delivered in year 1 has been object of minor revisions
§
Metadata of sessions has been
made consistent with present best practices
Corpora has been delivered internally but
are already public in a demo version in http://lablita/~cromdemo/
Corpus target of the C-ORAL-ROM catalogue
Sampling strategy: Strict definition of the
variation through social contexts of use and structure of the communication
event. Open set of domain of application.
150.000 words - at least 74 texts (64 x 1500 w. + 10 x 4.500 w.)
|
Family/Private context 124.500 words |
Public context 25.500 words |
||
|
- public - scripted |
+ public - scripted - public + partially
scripted |
||
|
Monologues 42.000 W. |
Dialogues/Conversation 82.500 W. |
Monologues
6.000 W. |
Dialogues/Conversations 19.500 W. |
Formal
Sampling
strategy : Definition of closed set of canonical
domains of application
150.000
words – at least 42 texts of 3000 W.+ a sampling of phone calls
|
Formal in natural context 65.000 w. |
Formal in media context 60.000 w. |
Telephone 25.000 w |
|
+ public + partially scripted |
+ public + partially scripted |
|
|
political speech news private conversations political debate meteo phone calls to call services (man-machine
interaction) preaching interviews
teaching |
reportage
professional explanation scientific press
conference sport
business talk shows political debate
law talk shows thematic discussions
talk shows culture
talk shows science |
private conversations phone calls to a call service (human-machine
interactions) |
C-ORAL-ROM metadata.
Metadata
of the C-ORAL-ROM resource are provided in both txt and xml format
•Metadata in the headers of each transcription
can be mapped over IMDI Metadata for Session Description http://www.mpi.nl/ISLE/documents/draft/ISLE_MetaData_2.5.pdf
All data regarding the C-ORAL-ROM resource as a whole as detailed in the corpus
structure can be mapped over IMDI Metadata for Catalogue Description http://www.mpi.nl/ISLE/documents/draft/IMDI_Catalogue_2.1.pdf
The C-ORAL-ROM collection of metadata is
detailed in a set of definitions and rules
Definitions and rules for Metadata.
|
Type label |
Definition |
|
@Title |
One or two word,. it should help to recognize
the text (in the object language) |
|
@File: |
Name of the file. The name of audio file and
the text file differ only in extension. |
|
@Participants: |
Three capital letters identifying each
speaker, followed by the corresponding proper name (first name) plus a sub field with an ordered set of information about the
speaker. |
|
@Date: |
separated by slashes; e.g. 20/06/2001 |
|
@Place: |
City of the recording |
|
@Situation: |
Ordered set of information separated by coma:
gender, role of participants in the situation, place of recording, main
action performed, recording conditions |
|
@Topic |
The main argument dealt with in the speech event
(max 50 characters) for e.g. problems with traffic |
|
@Source: |
Name of the collection leading to a copyright
holder; e.g. CORPAIX;
LABLITACORPUS; TECHE-RAI |
|
@Class: |
the set of fields about the text class in
accordance with the C-ORAL-ROM corpus target
(separated by commas) |
|
@Length: |
Length of the transcribed audio file in minutes(’) and seconds (”) e.g.: 12’ 15”
@Words: Number of words in the text file |
|
@Acoustic_quality |
A B or
C in accordance with general criteria |
|
@Transcriber: |
Name of the responsible for the text, who can
provide further information |
|
@Revisors |
Names of the revisors |
|
@Comments: |
Transcriber's comments about the text |
Rules for sub fields
Participant
|
Label |
Definition
|
|
Sex |
man or woman |
|
Age |
One capital letter: A (18-25); B (25-40); C
(40-50) D (>60) |
|
Education |
One number:
1 (primary school or illiteracy); 2 (high school) 3 (graduated or university students) |
|
Profession: |
Name of the profession |
|
Role: |
Role in the recorded event (even if it is
equal to the profession) |
|
Geographical origin/linguistic influence |
Name of the region |
|
Rules for the Situation field |
|
|
1 (Gender) |
Information that helps to define the gender of
activity that defines the linguistic event e.g. (gossip; chat; quarrel;
discussion; narration; claim, etc). The neutral case is "talk". The
information of the Class field
(dialogue, conversation etc.) should not be repeated. |
|
2 (Role of participants in the situation) |
The reciprocal roles of the participants
(e.g.friends, colleague, relatives) |
|
3 (Place of recording) |
The place where the recording take place (e.g
in the silent studio; in the street; at home; in a shop, at school, in the
office, etc.) |
|
4 (Main action performed) |
Main action performed (if any) |
|
5 (Recording conditions) |
Status of the recording with respect to the “Observer paradox” in spontaneous speech resources. A choice in the following two sets of labels, separated by coma: hidden / not hidden, researcher participant / researcher observer / researcher not present.
e.g. gossips between friends at home during the dinner, not hidden, researcher participant
For media corpora the situation field is
filled with the name of the program. |
Texts in which F0 analysis is not significant are excluded from sampling
|
Label |
Definition
|
|
A |
Digital recordings ( Unidirectional microphone) |
|
B |
Analogue recordings: Good microphone response, Low background noise Low percentage of overlapped utterances; F0 computing possible in most of the file despite possible disturbing
factors |
|
C |
Low quality analogue recordings. F0 computing possible in many parts of the files Mediocre microphone
response Mid - percentage of overlapped
utterances; |
|
Ordered set of sub-fields for text classification in the catalogue (separated by coma) |
||||
|
Class informal |
Class formal |
|||
|
Type: family/private public |
Type: family/private public |
Type: formal
in natural context |
Type: media |
Type:telephone |
|
Sub-type:
monologue dialogue conversation |
Sub-type:
monologue dialogue conversation |
Sub-type: political speech political debate preaching teaching professional explanation conference business law (through media)
|
Sub-type: news sport interviews meteo scientific press reportage talk shows political debate thematic discussions culture science |
Sub type: private conversation phone calls to call services
|
|
|
|
sub-sub-type: monologue dialogue conversation
|
|
|