Augmenting cross-domain knowledge bases using web tables
Oulabi, Yaser
URL:
|
https://madoc.bib.uni-mannheim.de/55962
|
URN:
|
urn:nbn:de:bsz:180-madoc-559627
|
Document Type:
|
Doctoral dissertation
|
Year of publication:
|
2020
|
Place of publication:
|
Mannheim
|
University:
|
Universität Mannheim
|
Evaluator:
|
Bizer, Christian
|
Date of oral examination:
|
3 August 2020
|
Publication language:
|
English
|
Institution:
|
School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
|
Subject:
|
004 Computer science, internet
|
Subject headings (SWD):
|
Erweiterung ; Wissensbasis ; Tabelle ; HTML ; Datenintegration
|
Keywords (English):
|
Data Integration , Knowledge Base Augmentation , Web Tables
|
Abstract:
|
Cross-domain knowledge bases are increasingly used for a large variety of applications. As the usefulness of a knowledge base for many of these applications increases with its completeness, augmenting knowledge bases with new knowledge is an important task. A source for this new knowledge could be in the form of web tables, which are relational HTML tables extracted from the Web.
This thesis researches data integration methods for cross-domain knowledge base augmentation from web tables. Existing methods have focused on the task of slot filling static data. We research methods that additionally enable augmentation in the form of slot filling time-dependent data and entity expansion.
When augmenting knowledge bases using time-dependent web table data, we require time-aware fusion methods. They identify from a set of conflicting web table values the one that is valid given a certain temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks. We introduce two time-aware fusion approaches. In the first, we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps. We introduce a second time-aware fusion method that exploits a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps.
Entity expansion enriches a knowledge base with previously unknown long-tail entities. It is a task that to our knowledge has not been researched before. We introduce the Long-Tail Entity Extraction Pipeline, the first system that can perform entity expansion from web table data. The pipeline works by employing identity resolution twice, once to disambiguate between entity occurrences within web tables, and once between entities created from web tables and existing entities in the knowledge base. In addition to identifying new long-tail entities, the pipeline also creates their descriptions according to the knowledge base schema.
By running the pipeline on a large-scale web table corpus, we profile the potential of web tables for the task of entity expansion. We find, that given certain classes, we can enrich a knowledge base with tens and even hundreds of thousands new entities and corresponding facts.
Finally, we introduce a weak supervision approach for long-tail entity extraction, where supervision in the form of a large number of manually labeled matching and non-matching pairs is substituted with a small set of bold matching rules build using the knowledge base schema. Using this, we can reduce the supervision effort required to train our pipeline to enable cross-domain entity expansion at web-scale.
In the context of this research, we created and published two datasets. The Time-Dependent Ground Truth contains time-dependent knowledge with more than one million temporal facts and corresponding temporal scope annotations. It could potentially be employed for a large variety of tasks that consider the temporal aspect of data. We also built the Web Tables for Long-Tail Entity Extraction gold standard, the first benchmark for the task of entity expansion from web tables.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Search Authors in
You have found an error? Please let us know about your desired correction here: E-Mail
Actions (login required)
|
Show item |
|
|