Compact open information extraction: methods, corpora, analysis

Gashteovski, Kiril

[img] PDF
thesis-kiril-gashteovski-final.pdf - Published

Download (4MB)

URN: urn:nbn:de:bsz:180-madoc-598136
Document Type: Doctoral dissertation
Year of publication: 2020
Place of publication: Mannheim
University: Universität Mannheim
Evaluator: Gemulla, Rainer
Date of oral examination: 17 March 2021
Publication language: English
Institution: School of Business Informatics and Mathematics > Praktische Informatik I (Gemulla 2014-)
Subject: 004 Computer science, internet
Keywords (English): Computer science , artificial intelligence , natural language processing , information extraction
Abstract: Most existing data is stored in unstructured textual formats, which makes their subsequent processing by computers more difficult. The Open Information Extraction (OpenIE) paradigm aims at structuring the knowledge that is contained in text into more machine readable formats. An OpenIE system (usually) extracts triples—(“subject”; “relation”; “object”)— from natural language text in an unsupervised manner, without having predefined relations. OpenIE extractions are used for improving deeper language-understanding tasks, including KB population, link prediction and text comprehension. A common problem for such systems is that they often extract triples which contain unnecessarily detailed constituents. For instance, the phrases “the great Richard Feynman” and “Richard Feynman” have the same meaning, but the first phrase contains redundant words—“the” and “great”—that do not alter the meaning of the head phrase “Richard Feynman”. Such redundant words pose difficulties for using OpenIE in downstream tasks, such as linking entities for KB population. In this thesis, we propose MinIE, an OpenIE system which aims to remove words from the triples that are considered to be overly-specific without damaging the triple’s semantics. The methods proposed in MinIE are domain independent and could in principle be integrated into any other OpenIE system. OpenIE extractions are most useful when they are available in large quantities. Our second contribution, therefore, is OPIEC, which is the largest publicly available OpenIE corpus to date (containing 341M triples). OPIEC was constructed from the entire English Wikipedia and it contains the links found in theWikipedia articles, thus reducing ambiguity in certain cases. Such OpenIE triples with unambiguous arguments are useful for bootstrapping OpenIE extractors as well as for downstream tasks such as KB population. Our final contribution is an analysis of OPIEC. Such analysis is difficult to perform due to the openness and ambiguity of OpenIE extractions. Therefore, we compared the content of OPIEC with reference KBs (DBpedia and YAGO), which are not ambiguous and are also constructed from Wikipedia. Our analysis is (mostly) manual and reveals findings about semantic relatedness between OpenIE corpora and KBs, which are important for downstream tasks such as KB population (e.g., the study suggests that most knowledge found in OpenIE triples is relevant for the current KBs and it is not present in the KBs).

Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.

Metadata export


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics

You have found an error? Please let us know about your desired correction here: E-Mail

Actions (login required)

Show item Show item