UNIVERSITÀ DEGLI STUDI DI FIRENZE (Italian Department) - UNIFI
UNIVERSITÀ DEGLI STUDI DI SIENA (Communication Sciences Department) - UNISI
CNR (Pisa, Istituto di Linguistica Computazionale “Antonio Zampolli”) - CNR(ILC)
Prof. Massimo Moneglia, Università di Firenze - LABLITA, P.za Savonarola, 1
The semantic problem of action oriented verbs
In all language modalities Action verbs bear the basic information that should be processed in order to make sense of a sentence. Especially in speech, they are the more frequent structuring elements of the discourse, but their semantic nature does not specify the referred action. The most frequent action verbs are “general”, i.e. they are able to extend to actions belonging to different ontological types. Moreover, each language categorizes action in its own way and the cross-linguistic reference to everyday activities is therefore puzling. For instance, in all occurrence of to take in events belonging to the type “John takes the cat (from Mary)” the English verb will be translated into Italian with prendere, while prendere will be translated into English with the verb to catch in all event of the type “Mario ha preso il gatto (which was running away)”, that are also in the extension of the Italian verb. See the table below which identify the variation of action types falling within the extension of Italian, Spanish, French and English verbs which are considered in translation relation.
Action verbs are puzzling also for bilingual dictionaries and they cause major problems to Automatic translation and second language acquisition. No one to one correspondence can be established between action predicates in different languages, since the ontological entity referred by action verbs are not identified and there is no guarantee that a predicate in a bilingual dictionary pick up the same entity. This problem is extremely sensible because action verbs are high frequency both in speech and in all basic translation tasks, but the above semantic relations cannot be predicted, since they require general ontological knowledge which is not avaliable. Current approaches to Translation, based on collocations and parallel surface language strings, lack this information. In other words, the semantic variation of general verbs is not due to language specific phraseology, but is rather a consequence of the peculiar way each natural language categorize events; i.e. it is a consequence of semantic factors. Moreover both traditional dictionaries and modern ontologies contain only fragmentary information when the semantic variation of verbs in concerned and there is no guarantee that the user can select the action type of his interest among the examples and definition therein. More in general the presence of general verb in the lexicon of natural languages is one of the crucial reasons for which the acquisition of basic verbal lexicon is problematic in second language acquisition, in special in its early phases.
The cross-linguistic ontology of Action
Nevertheless, the application of general verbs to the action types in their extension is productive and should be in principle predictable. For instance prendere and to catch will be applied to all occurrences of the above type independently from the argument involved in the event. The examples in the following table, which are derived from corpus occurrences, clearly show that all occurrences belonging to one type show a translation relation among the verbs applying to the type in the four languages. Translation is not a problem once the action types are identified and one a verb occurrence is assigned to one type. But the ontology of action is not available in any existing repository and the actual variation of general verbs is unknown both at intra and cross-linguistic levels. Existing ontologies (Wordnet, framenet, Prop-bank) can hardly be used for disambiguation. When action concepts are concerned, they provide information that is even more fragmentary than for individuals. As a consequence of this, they have a low impact on processing and translation systems, since even common verbs cannot be processed.
IMAGACT aims to set up an infrastructure that can strongly reduce the present theoretical limits to Natural Language Understanding for what regards the reference to action at both intra and cross linguistic levels, providing an Cross-linguistic Action Ontology which specify in unambiguous manner the range of ontological variation of action verbs in different languages, so allowing their productive translation. The information concerning the relationship among actions as ontological entities and their cross-linguistic lexical encoding can be derived from available resources. Spontaneous Speech Corpora, contain a reference to the more frequent actions in everyday life and their lexical encoding. The actions that are more frequently performed in our everyday environment can be identified observing the reference of high frequency verbs in spontaneous speech corpora.
IMAGACT will use corpus-based and competence-based methodologies for simultaneously bootstrapping both the action types and their linguistic encoding from Italian and English Spoken Corpora. The IMAGACT key strategy focuses on semantic annotation of the verb occurrences. The annotation simultaneously identifies the referred action types and their linguistic encoding in each implemented language, with the full range of syntactic and statistic information derived from corpora. In accordance with preliminary studies the above multilingual corpus will allow to bootstrap from the actual use of language represented in corpora an Ontology Data Base of roughly 3000 high frequency Action types mapped onto around 500 lexical entry per language.
Action types entries as prototypic scene
The experience in ontology building has shown that the level of consensus that can be reached in defining entities is very low. Definitions are highly underdetermined since they depend on granularity. Crucially, in common approaches, the identification of an ontological entry and its definition coincide, since identification relies on definition. So, to identify an interlinguistic set of action types through definitions agreed by annotators working on different language corpora could be considered hopeless.
The key innovation for simultaneous bootstrapping of a language independent ontology from different language resources is to identify action entries in the ontology by non-linguistic means. An innovative process of information extraction, independent from the language, will identify the Action types in terms of prototypic scenes. In IMAGACT action types are identified through a prototypic scene in a Wittgenstein like scenario. Competent speakers can easily judge when their occurrences are instances of that type independently of their language, so reconciling in one sole ontological entry all possible definitions, regardless their different level of granularity and precision.
From scenes to language
As a result of corpus annotation and scene production, the project will result in an Inter-linguistic Action Ontology DB, and will allow the mapping of action types found in corpora onto the predicates identifying those entries in English and Italian . Selecting the action type scene, the user will access to the information of what is the right predicate for an action in a second language without passing through a process of translation from his language. The DB will allow the competence based implementation of the IMAGACT infrastructure through competence based judgments. Once relevant action types are identified competent speakers of languages other than English and Italian will recognize what is the right predicate in his language for the action type into object. IMAGACT will implement Chinese (Mandarin) and Spanish (as in principle all possible languages) in the DB without passing through the expensive process of corpus annotation.
The Otology derives from speech and contains a huge amount of information on actions that is relevant for ambient intelligence and, from a different perspective will ground the modeling of artificial systems aimed to interact in natural environment on the basis of natural language instructions. IMAGACT, however will be tested language acquisition scenarios and assisted translation scenario. The DB will be delivered as an Internet service providing the translation of the action concepts. Given one verb in the origin text the service will return, in probability order, the set of ontological entities representing the possible variation of the original verb, each one associated with the lemma coping with that entity in the selected language.
Last modified 10 February 2012, 13:16