Using schema.org annotations for training and maintaining product matchers


Peeters, Ralph ; Primpeli, Anna ; Wichtlhuber, Benedikt ; Bizer, Christian



DOI: https://doi.org/10.1145/3405962.3405964
URL: https://dl.acm.org/doi/10.1145/3405962.3405964
Weitere URL: https://dl.acm.org/doi/proceedings/10.1145/3405962
Dokumenttyp: Konferenzveröffentlichung
Erscheinungsjahr: 2020
Buchtitel: WIMS 2020: proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France, June 30 - July 3, 2020
Seitenbereich: 195-204
Veranstaltungstitel: WIMS 2020
Veranstaltungsort: Online
Veranstaltungsdatum: 30.06.-03.07.2020
Herausgeber: Chbeir, Richard
Ort der Veröffentlichung: New York, NY
Verlag: ACM
ISBN: 978-1-4503-7542-9
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet: 004 Informatik
Freie Schlagwörter (Englisch): e-commerce , schema.org , product matching , semantic web , neural networks
Abstract: Product matching is a central task within e-commerce applications such as price comparison portals and online market places. State-of-the-art product matching methods achieve F1 scores above 0.90 using deep learning techniques combined with huge amounts of training data (e.g > 100K pairs of offers). Gathering and maintaining such large training corpora is costly, as it implies labeling pairs of offers as matches or non-matches. Acquiring the ability to be good at product matching thus means a major investment for an e-commerce company. This paper shows that the manual labeling of training data for product matching can be replaced by relying exclusively on schema.org annotations gathered from the public Web. We show that using only schema.org data for training, we are able to achieve F1 scores between 0.92 and 0.95 depending on the product category. As new products appear everyday, it is important that matching models can be maintained with justifiable effort. In order to give practical advice on how to maintain matching models, we compare the performance of deep learning and traditional matching models on unseen products and experiment with different fine-tuning and re-training strategies for model maintenance, again using only schema.org annotations as training data. Finally, as using the public Web as distant supervision carries inherent noise, we evaluate deep learning and traditional matching models with regards to their label-noise resistance and show that deep learning is able to deal with the amounts of identifier-noise found in schema.org annotations.
Zusätzliche Informationen: Online-Ressource




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Aufruf-Statistik

Aufrufe im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen