Stitching web tables for improving matching quality
Lehmberg, Oliver
;
Bizer, Christian
DOI:
|
https://doi.org/10.14778/3137628.3137657
|
URL:
|
https://www.researchgate.net/publication/319599651...
|
Weitere URL:
|
http://www.vldb.org/pvldb/vol10/p1502-lehmberg.pdf
|
Dokumenttyp:
|
Zeitschriftenartikel
|
Erscheinungsjahr:
|
2017
|
Titel einer Zeitschrift oder einer Reihe:
|
Proceedings of the VLDB Endowment
|
Band/Volume:
|
10
|
Heft/Issue:
|
11
|
Seitenbereich:
|
1502-1513
|
Ort der Veröffentlichung:
|
New York, NY [u.a.]
|
Verlag:
|
Assoc. of Computing Machinery
|
ISSN:
|
2150-8097
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
|
Fachgebiet:
|
004 Informatik
|
Freie Schlagwörter (Englisch):
|
Data Integration , Matching , Web Tables , Knowledge Bases , DBpedia
|
Abstract:
|
HTML tables on web pages (“web tables”) cover a wide variety of topics. Data from web tables can thus be useful for tasks such as knowledge base completion or ad hoc table extension. Before table data can be used for these tasks, the tables must be matched to the respective knowledge base or base table. The challenges of web table matching are the high heterogeneity and the small size of the tables. Though it is known that the majority of web tables are very small, the gold standards that are used to compare web table matching systems mostly consist of larger tables. In this experimental paper, we evaluate T2K Match, a web table to knowledge base matching system, and COMA, a standard schema matching tool, using a sample of web tables that is more realistic than the gold standards that were previously used. We find that both systems fail to produce correct results for many of the very small tables in the sample. As a remedy, we propose to stitch (combine) the tables from each web site into larger ones and match these enlarged tables to the knowledge base or base table afterwards. For this stitching process, we evaluate different schema matching methods in combination with holistic correspondence refinement. Limiting the stitching procedure to web tables from the same web site decreases the heterogeneity and allows us to stitch tables with very high precision. Our experiments show that applying table stitching before running the actual matching method improves the matching results by 0.38 in F1-measure for T2K Match and by 0.14 for COMA. Also, stitching the tables allows us to reduce the amount of tables in our corpus from 5 million original web tables to as few as 100,000 stitched tables.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
Suche Autoren in
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|