The WDC training dataset and gold standard for large-scale product matching
Primpeli, Anna
;
Peeters, Ralph
;
Bizer, Christian
DOI:
|
https://doi.org/10.1145/3308560.3316609
|
URL:
|
https://dl.acm.org/citation.cfm?doid=3308560.33166...
|
Weitere URL:
|
https://www2019.thewebconf.org/media/TheWebConf201...
|
Dokumenttyp:
|
Konferenzveröffentlichung
|
Erscheinungsjahr:
|
2019
|
Buchtitel:
|
Companion Proceedings of The 2019 World Wide Web Conference
|
Seitenbereich:
|
381-386
|
Veranstaltungstitel:
|
Workshop on e-Commerce and NLP at WWW 2019
|
Veranstaltungsort:
|
San Francisco, CA
|
Veranstaltungsdatum:
|
May 13-17, 2019
|
Herausgeber:
|
Liu, Ling
|
Ort der Veröffentlichung:
|
New York, NY, USA
|
Verlag:
|
ACM
|
ISBN:
|
978-1-4503-6675-5
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
|
Fachgebiet:
|
004 Informatik
|
Freie Schlagwörter (Englisch):
|
deep matching , embeddings , entity resolution , evaluation data , product matching , schema.org annotations
|
Abstract:
|
A current research question in the area of entity resolution (also called link discovery or duplicate detection) is whether and in which cases embeddings and deep neural network based matching methods outperform traditional symbolic matching methods. The problem with answering this question is that deep learning based matchers need large amounts of training data. The entity resolution benchmark datasets that are currently available to the public are too small to properly evaluate this new family of matching methods. The WDC Training Dataset for Large-Scale Product Matching fills this gap. The English language subset of the training dataset consists of 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e-shops which provide schema.org annotations including some form of product ID such as a GTIN or MPN. We also created a gold standard by manually verifying 2200 pairs of offers belonging to four product categories. Using a subset of our training dataset together with this gold standard, we are able to publicly replicate the recent result of Mudgal et al. that embeddings and deep neural network based matching methods outperform traditional symbolic matching methods on less structured data.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
Suche Autoren in
BASE:
Primpeli, Anna
;
Peeters, Ralph
;
Bizer, Christian
Google Scholar:
Primpeli, Anna
;
Peeters, Ralph
;
Bizer, Christian
ORCID:
Primpeli, Anna ORCID: https://orcid.org/0000-0002-1783-2482, Peeters, Ralph ORCID: https://orcid.org/0000-0003-3174-2616 and Bizer, Christian ORCID: https://orcid.org/0000-0003-2367-0237
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|