Using schema.org annotations for training and maintaining product matchers
Peeters, Ralph
;
Primpeli, Anna
;
Wichtlhuber, Benedikt
;
Bizer, Christian
DOI:
|
https://doi.org/10.1145/3405962.3405964
|
URL:
|
https://dl.acm.org/doi/10.1145/3405962.3405964
|
Additional URL:
|
https://dl.acm.org/doi/proceedings/10.1145/3405962
|
Document Type:
|
Conference or workshop publication
|
Year of publication:
|
2020
|
Book title:
|
WIMS 2020: proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France, June 30 - July 3, 2020
|
Page range:
|
195-204
|
Conference title:
|
WIMS 2020
|
Location of the conference venue:
|
Online
|
Date of the conference:
|
30.06.-03.07.2020
|
Publisher:
|
Chbeir, Richard
|
Place of publication:
|
New York, NY
|
Publishing house:
|
ACM
|
ISBN:
|
978-1-4503-7542-9
|
Publication language:
|
English
|
Institution:
|
School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
|
Subject:
|
004 Computer science, internet
|
Keywords (English):
|
e-commerce , schema.org , product matching , semantic web , neural networks
|
Abstract:
|
Product matching is a central task within e-commerce applications such as price comparison portals and online market places. State-of-the-art product matching methods achieve F1 scores above 0.90 using deep learning techniques combined with huge amounts of training data (e.g > 100K pairs of offers). Gathering and maintaining such large training corpora is costly, as it implies labeling pairs of offers as matches or non-matches. Acquiring the ability to be good at product matching thus means a major investment for an e-commerce company. This paper shows that the manual labeling of training data for product matching can be replaced by relying exclusively on schema.org annotations gathered from the public Web. We show that using only schema.org data for training, we are able to achieve F1 scores between 0.92 and 0.95 depending on the product category. As new products appear everyday, it is important that matching models can be maintained with justifiable effort. In order to give practical advice on how to maintain matching models, we compare the performance of deep learning and traditional matching models on unseen products and experiment with different fine-tuning and re-training strategies for model maintenance, again using only schema.org annotations as training data. Finally, as using the public Web as distant supervision carries inherent noise, we evaluate deep learning and traditional matching models with regards to their label-noise resistance and show that deep learning is able to deal with the amounts of identifier-noise found in schema.org annotations.
|
Additional information:
|
Online-Ressource
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
Search Authors in
BASE:
Peeters, Ralph
;
Primpeli, Anna
;
Wichtlhuber, Benedikt
;
Bizer, Christian
Google Scholar:
Peeters, Ralph
;
Primpeli, Anna
;
Wichtlhuber, Benedikt
;
Bizer, Christian
ORCID:
Peeters, Ralph ORCID: https://orcid.org/0000-0003-3174-2616, Primpeli, Anna ORCID: https://orcid.org/0000-0002-1783-2482, Wichtlhuber, Benedikt and Bizer, Christian ORCID: https://orcid.org/0000-0003-2367-0237
You have found an error? Please let us know about your desired correction here: E-Mail
Actions (login required)
|
Show item |
|
|