Beginnjahr 2011 Abschlussjahr 2014

Institutionen

durchführende Institutionen

Personen

ProjektleiterInnen+Ansprechpersonen MitarbeiterInnen
Ländercode Österreich Sprachcode Deutsch, Englisch
Schlagwörter DeutschMehrwortkombinationen
Schlagwörter Englischlexicology, corpus linguistics, computational linguistics, semantics and cognitive semantics, natural language processing
Abstrakt

A recurring problem in modern lexicography, as well as in the development of machine translation tools, information retrieval, information extraction, intelligent human-computer interface, question answering, bioinformatics and applied linguistics, is the identification of prototypical senses of polysemous words, the degree of sense distinctiveness and the structure of the lexical network. The goal of the research done within the project is to explore a method of solving such problems by combining lexical semantics and computer linguistics. The emphasis is foremost on the cognitive approach to sense identification and computational linguistic processing of language.

Basically, with the help of a comprehensive corpus-based analysis, the first stage of the project has already completed the daunting task of a cognitive and a sociosemantic breakdown of the many senses of a polysemous word look (as a noun, verb and multi-word expression) in English based on the work done by various authors, most notably Gries (2006). The stage 2 of the project in effect envisages a system based on the developed computer linguistic technology (namely the CASIS 1.1 corpus tool) aiming towards the ultimate solution of word sense disambiguation. The key elements of this study – ID tags – are sets of semantic and syntactic patterns surrounding a given sense of the given word. The notion of an ID tag is and is defined as “syntactic or lexical markers in the citations which point to a particular dictionary sense of the word” (Atkins, 1987). ID tags are distinguished on the basis of whether they are categorically or probabilistically linked to a particular sense and whether the link with the given word is direct or indirect (indirect being through other words). The research done so far indicates that the predictive power of some ID tags is fairly high.

The corpus analysis already completed during this stage one of the project is based on 18 000 random citations containing look, extracted from the Corpus of Contemporary American English (400 million words) and the Google Books Corpus (155 billion words), which have been chosen to serve as the basis of the analysis due to their representativeness, both in sources and in the number of lexical items (Dobrić, 2009b). Also within this stage all of the extracted citations containing look have been analyzed for ID tags, accompanying each of the previously identified senses in each of the citations. The ID tags which have been looked into are as follows:

(1) morphological features of the given word form respectively:

(a) for verbs: tense, aspect, and voice;

(b) for nouns: singular, plural; countable, uncountable, possessive form; abstract vs. concrete; animate vs. inanimate;

(2) the syntactic properties of:

(a) the given word;

(b) the clause the given word form occurs in: intransitive vs. transitive vs. complex transitive use of verbs, declarative vs. interrogative vs. imperative sentence form, main clause vs. subordinate clause;

(3) semantic characteristics of:

(a) the given word;

(b) the referents of the elements co-occurring with the given word: its subjects/heads, objects and complements (i.e. as human, animate, concrete countable objects, concrete mass nouns, machines, abstract entities, organizations/institutions, locations, quantities, events, processes etc.);

*(4) Twitter knowledge base data: still in the development stage, the information form a Twitter account are meant to be used as an ID tag in order to solve the problem of sparseness of data;

(5) the instance’s L1-L2 and R1-R2 collocates in the same clause;

(6) the given word’s meaning in the citation.

The next, and the main, part of the project is to pair-up the identified senses with their respective listed ID tags and then match them up with the tagged corpus (which is also to be comnstructed), the tagging of which is meant to reflect the structuring of the ID tags. The final version of the CASIS program should be a small computer system which would be able to identify a particular sense of a given word based on the previously determined ID tags which characterize it, and will be able to do so in any relatively basically tagged corpus.

Hauptkategorie(n)Bildungsinhalt (Themenfeld)
Information, Kommunikation, Statistik
Lehren und Lernen (Prozesse und Methoden)
Mit den Themen des Projekts weitersuchen