Compact open information extraction: methods, corpora, analysis


Gashteovski, Kiril


[img] PDF
thesis-kiril-gashteovski-final.pdf - Veröffentlichte Version

Download (4MB)

URL: https://madoc.bib.uni-mannheim.de/59813
URN: urn:nbn:de:bsz:180-madoc-598136
Dokumenttyp: Dissertation
Erscheinungsjahr: 2020
Ort der Veröffentlichung: Mannheim
Hochschule: Universität Mannheim
Gutachter: Gemulla, Rainer
Datum der mündl. Prüfung: 17 März 2021
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Practical Computer Science I: Data Analytics (Gemulla 2014-)
Fachgebiet: 004 Informatik
Freie Schlagwörter (Englisch): Computer science , artificial intelligence , natural language processing , information extraction
Abstract: Most existing data is stored in unstructured textual formats, which makes their subsequent processing by computers more difficult. The Open Information Extraction (OpenIE) paradigm aims at structuring the knowledge that is contained in text into more machine readable formats. An OpenIE system (usually) extracts triples—(“subject”; “relation”; “object”)— from natural language text in an unsupervised manner, without having predefined relations. OpenIE extractions are used for improving deeper language-understanding tasks, including KB population, link prediction and text comprehension. A common problem for such systems is that they often extract triples which contain unnecessarily detailed constituents. For instance, the phrases “the great Richard Feynman” and “Richard Feynman” have the same meaning, but the first phrase contains redundant words—“the” and “great”—that do not alter the meaning of the head phrase “Richard Feynman”. Such redundant words pose difficulties for using OpenIE in downstream tasks, such as linking entities for KB population. In this thesis, we propose MinIE, an OpenIE system which aims to remove words from the triples that are considered to be overly-specific without damaging the triple’s semantics. The methods proposed in MinIE are domain independent and could in principle be integrated into any other OpenIE system. OpenIE extractions are most useful when they are available in large quantities. Our second contribution, therefore, is OPIEC, which is the largest publicly available OpenIE corpus to date (containing 341M triples). OPIEC was constructed from the entire English Wikipedia and it contains the links found in theWikipedia articles, thus reducing ambiguity in certain cases. Such OpenIE triples with unambiguous arguments are useful for bootstrapping OpenIE extractors as well as for downstream tasks such as KB population. Our final contribution is an analysis of OPIEC. Such analysis is difficult to perform due to the openness and ambiguity of OpenIE extractions. Therefore, we compared the content of OPIEC with reference KBs (DBpedia and YAGO), which are not ambiguous and are also constructed from Wikipedia. Our analysis is (mostly) manual and reveals findings about semantic relatedness between OpenIE corpora and KBs, which are important for downstream tasks such as KB population (e.g., the study suggests that most knowledge found in OpenIE triples is relevant for the current KBs and it is not present in the KBs).




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Download-Statistik

Downloads im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen