Using schema.org annotations for training and maintaining product matchers


Peeters, Ralph ; Primpeli, Anna ; Wichtlhuber, Benedikt ; Bizer, Christian



DOI: https://doi.org/10.1145/3405962.3405964
URL: https://dl.acm.org/doi/10.1145/3405962.3405964
Additional URL: https://dl.acm.org/doi/proceedings/10.1145/3405962
Document Type: Conference or workshop publication
Year of publication: 2020
Book title: WIMS 2020: proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France, June 30 - July 3, 2020
Page range: 195-204
Conference title: WIMS 2020
Location of the conference venue: Online
Date of the conference: 30.06.-03.07.2020
Publisher: Chbeir, Richard
Place of publication: New York, NY
Publishing house: ACM
ISBN: 978-1-4503-7542-9
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Subject: 004 Computer science, internet
Keywords (English): e-commerce , schema.org , product matching , semantic web , neural networks
Abstract: Product matching is a central task within e-commerce applications such as price comparison portals and online market places. State-of-the-art product matching methods achieve F1 scores above 0.90 using deep learning techniques combined with huge amounts of training data (e.g > 100K pairs of offers). Gathering and maintaining such large training corpora is costly, as it implies labeling pairs of offers as matches or non-matches. Acquiring the ability to be good at product matching thus means a major investment for an e-commerce company. This paper shows that the manual labeling of training data for product matching can be replaced by relying exclusively on schema.org annotations gathered from the public Web. We show that using only schema.org data for training, we are able to achieve F1 scores between 0.92 and 0.95 depending on the product category. As new products appear everyday, it is important that matching models can be maintained with justifiable effort. In order to give practical advice on how to maintain matching models, we compare the performance of deep learning and traditional matching models on unseen products and experiment with different fine-tuning and re-training strategies for model maintenance, again using only schema.org annotations as training data. Finally, as using the public Web as distant supervision carries inherent noise, we evaluate deep learning and traditional matching models with regards to their label-noise resistance and show that deep learning is able to deal with the amounts of identifier-noise found in schema.org annotations.
Additional information: Online-Ressource




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadata export


Citation


+ Search Authors in

+ Page Views

Hits per month over past year

Detailed information



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item