Using LLMs for the extraction and normalization of product attribute values


Brinkmann, Alexander ; Baumann, Nick ; Bizer, Christian



DOI: https://doi.org/10.1007/978-3-031-70626-4_15
URL: https://link.springer.com/chapter/10.1007/978-3-03...
Dokumenttyp: Konferenzveröffentlichung
Erscheinungsjahr: 2024
Buchtitel: Advances in databases and information systems : 28th European Conference, ADBIS 2024, Bayonne, France, August 28-31, 2024 ; Proceedings
Titel einer Zeitschrift oder einer Reihe: Lecture Notes in Computer Science
Band/Volume: 14918
Seitenbereich: 217-230
Veranstaltungstitel: 28th European Conference on Advances in Databases and Information Systems (ADBIS 2024)
Veranstaltungsort: Bayonne, France
Veranstaltungsdatum: 28.-31.08.2024
Herausgeber: Tekli, Joe ; Gamper, Johann ; Chbeir, Richard ; Manolopoulos, Yannis
Ort der Veröffentlichung: Berlin [u.a.]
Verlag: Springer
ISBN: 978-3-031-70628-8 , 978-3-031-70626-4
ISSN: 0302-9743 , 1611-3349
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet: 004 Informatik
Freie Schlagwörter (Englisch): information extraction , value normalization , Large Language Models
Abstract: Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Aufruf-Statistik

Aufrufe im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen