Web-scale web table to knowledge base matching

Ritze, Dominique

Vorschau

PDF
thesis.pdf - Veröffentlichte Version
Download (4MB)

URL:	https://ub-madoc.bib.uni-mannheim.de/43123
URN:	urn:nbn:de:bsz:180-madoc-431233
Dokumenttyp:	Dissertation
Erscheinungsjahr:	2017
Ort der Veröffentlichung:	Mannheim
Hochschule:	Universität Mannheim
Gutachter:	Bizer, Christian
Datum der mündl. Prüfung:	6 November 2017
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet:	004 Informatik
Normierte Schlagwörter (SWD):	Matching
Freie Schlagwörter (Englisch):	Web Table , Matching , Knowledge Base
Abstract:	Millions of relational HTML tables are found on the World Wide Web. In contrast to unstructured text, relational web tables provide a compact representation of entities described by attributes. The data within these tables covers a broad topical range. Web table data is used for question answering, augmentation of search results, and knowledge base completion. Until a few years ago, only search engines companies like Google and Microsoft owned large web crawls from which web tables are extracted. Thus, researches outside the companies have not been able to work with web tables. In this thesis, the first publicly available web table corpus containing millions of web tables is introduced. The corpus enables interested researchers to experiment with web tables. A profile of the corpus is created to give insights to the characteristics and topics. Further, the potential of web tables for augmenting cross-domain knowledge bases is investigated. For the use case of knowledge base augmentation, it is necessary to understand the web table content. For this reason, web tables are matched to a knowledge base. The matching comprises three matching tasks: instance, property, and class matching. Existing web table to knowledge base matching systems either focus on a subset of these matching tasks or are evaluated using gold standards which also only cover a subset of the challenges that arise when matching web tables to knowledge bases. This thesis systematically evaluates the utility of a wide range of different features for the web table to knowledge base matching task using a single gold standard. The results of the evaluation are used afterwards to design a holistic matching method which covers all matching tasks and outperforms state-of-the-art web table to knowledge base matching systems. In order to achieve these goals, we first propose the T2K Match algorithm which addresses all three matching tasks in an integrated fashion. In addition, we introduce the T2D gold standard which covers a wide variety of challenges. By evaluating T2K Match against the T2D gold standard, we identify that only considering the table content is insufficient. Hence, we include features of three categories: features found in the table, in the table context like the page title, and features that base on external resources like a synonym dictionary. We analyze the utility of the features for each matching task. The analysis shows that certain problems cannot be overcome by matching each table in isolation to the knowledge base. In addition, relying on the features is not enough for the property matching task. Based on these findings, we extend T2K Match into T2K Match++ which exploits indirect matches to web tables about the same topic and uses knowledge derived from the knowledge base. We show that T2K Match++ outperforms all state-of-the-art web table to knowledge base matching approaches on the T2D and Limaye gold standard. Most systems show good results on one matching task but T2K Match++ is the only system that achieves F-measure scores above 0:8 for all tasks. Compared to results of the best performing system TableMiner+, the F-measure for the difficult property matching task is increased by 0.08, for the class and instance matching task by 0.05 and 0.03, respectively.