Using LLMs for the extraction and normalization of product attribute values

Brinkmann, Alexander ; Baumann, Nick ; Bizer, Christian

DOI:	https://doi.org/10.1007/978-3-031-70626-4_15
URL:	https://link.springer.com/chapter/10.1007/978-3-03...
Dokumenttyp:	Konferenzveröffentlichung
Erscheinungsjahr:	2024
Buchtitel:	Advances in databases and information systems : 28th European Conference, ADBIS 2024, Bayonne, France, August 28-31, 2024 ; Proceedings
Titel einer Zeitschrift oder einer Reihe:	Lecture Notes in Computer Science
Band/Volume:	14918
Seitenbereich:	217-230
Veranstaltungstitel:	28th European Conference on Advances in Databases and Information Systems (ADBIS 2024)
Veranstaltungsort:	Bayonne, France
Veranstaltungsdatum:	28.-31.08.2024
Herausgeber:	Tekli, Joe ; Gamper, Johann ; Chbeir, Richard ; Manolopoulos, Yannis
Ort der Veröffentlichung:	Berlin [u.a.]
Verlag:	Springer
ISBN:	978-3-031-70628-8 , 978-3-031-70626-4
ISSN:	0302-9743 , 1611-3349
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet:	004 Informatik
Freie Schlagwörter (Englisch):	information extraction , value normalization , Large Language Models
Abstract:	Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.

Dieser Eintrag ist Teil der Universitätsbibliographie.

Suche Autoren in

BASE: Brinkmann, Alexander ; Baumann, Nick ; Bizer, Christian

Google Scholar: Brinkmann, Alexander ; Baumann, Nick ; Bizer, Christian

ORCID: Brinkmann, Alexander

; Baumann, Nick ; Bizer, Christian

Aufruf-Statistik

Aufrufe im letzten Jahr

Detaillierte Angaben

Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail

Actions (login required)

Eintrag anzeigen

Using LLMs for the extraction and normalization of product attribute values

Metadaten-Export

Zitation

Suche Autoren in

Aufruf-Statistik

Actions (login required)