Web table integration and profiling for knowledge base augmentation

Lehmberg, Oliver

thesis.pdf - Published

Download (23MB)

URL: https://madoc.bib.uni-mannheim.de/52346
URN: urn:nbn:de:bsz:180-madoc-523468
Document Type: Doctoral dissertation
Year of publication: 2019
Place of publication: Mannheim
University: Universität Mannheim
Evaluator: Bizer, Christian
Date of oral examination: 26 September 2019
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Subject: 004 Computer science, internet
Keywords (English): web tables , data integration , knowledge base augmentation , data profiling
Abstract: HTML tables on web pages ("web tables") have been used successfully as a data source for several applications. They can be extracted from web pages on a large-scale, resulting in corpora of millions of web tables. But, until today only little is known about the general distribution of topics and specific types of data that are contained in the tables that can be found on the Web. But this knowledge is essential to understanding the potential application areas and topical coverage of web tables as a data source. Such knowledge can be obtained through the integration of web tables with a knowledge base, which enables the semantic interpretation of their content and allows for their topical profiling. In turn, the knowledge base can be augmented by adding new statements from the web tables. This is challenging, because the data volume and variety are much larger than in traditional data integration scenarios, in which only a small number of data sources is integrated. The contributions of this thesis are methods for the integration of web tables with a knowledge base and the profiling of large-scale web table corpora through the application of these methods. For this profiling, two corpora of 147 million and 233 million web tables, respectively, are created and made publicly available. These corpora are two of only three that are openly available for research on web tables. Their data profile reveals that most web tables have only very few rows, with a median of 6 rows per web table, and between 35% and 52% of all columns contain non-textual values, such as numbers or dates. These two characteristics have been mostly ignored in the literature about web tables and are addressed by the methods presented in this thesis. The first method, T2K Match, is an algorithm for semantic table interpretation that annotates web tables with classes, properties, and entities from a knowledge base. Other than most algorithms for these tasks, it is not limited to the annotation of columns that contain the names of entities. Its application to a large-scale web table corpus results in the most fine-grained topical data profile of web tables at the time of writing, but also reveals that small web tables cannot be processed with high quality. For such small web tables, a method that stitches them into larger tables is presented and shown to drastically improve the quality of the results. The data profile further shows that the majority of the columns in the web tables, where classes and entities can be recognised, have no corresponding properties in the knowledge base. This makes them candidates for new properties that can be added to the knowledge base. The current methods for this task, however, suffer from the oversimplified assumption that web tables only contain binary relations. This results in the extraction of incomplete relations from the web tables as new properties and makes their correct interpretation impossible. To increase the completeness, a method is presented that generates additional data from the context of the web tables and synthesizes n-ary relations from all web tables of a web site. The application of this method to the second large-scale web table corpus shows that web tables contain a large number of n-ary relations. This means that the data contained in web tables is of higher complexity than previously assumed.

Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.

Metadata export


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics

You have found an error? Please let us know about your desired correction here: E-Mail

Actions (login required)

Show item Show item