Stitching web tables for improving matching quality


Lehmberg, Oliver ; Bizer, Christian



DOI: https://doi.org/10.14778/3137628.3137657
URL: https://www.researchgate.net/publication/319599651...
Additional URL: http://www.vldb.org/pvldb/vol10/p1502-lehmberg.pdf
Document Type: Article
Year of publication: 2017
The title of a journal, publication series: Proceedings of the VLDB Endowment
Volume: 10
Issue number: 11
Page range: 1502-1513
Place of publication: New York, NY [u.a.]
Publishing house: Assoc. of Computing Machinery
ISSN: 2150-8097
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Subject: 004 Computer science, internet
Keywords (English): Data Integration , Matching , Web Tables , Knowledge Bases , DBpedia
Abstract: HTML tables on web pages (“web tables”) cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables. Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadata export


Citation


+ Search Authors in

+ Page Views

Hits per month over past year

Detailed information



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item