Using LLMs for the extraction and normalization of product attribute values


Brinkmann, Alexander ; Baumann, Nick ; Bizer, Christian



DOI: https://doi.org/10.1007/978-3-031-70626-4_15
URL: https://link.springer.com/chapter/10.1007/978-3-03...
Document Type: Conference or workshop publication
Year of publication: 2024
Book title: Advances in databases and information systems : 28th European Conference, ADBIS 2024, Bayonne, France, August 28-31, 2024 ; Proceedings
The title of a journal, publication series: Lecture Notes in Computer Science
Volume: 14918
Page range: 217-230
Conference title: 28th European Conference on Advances in Databases and Information Systems (ADBIS 2024)
Location of the conference venue: Bayonne, France
Date of the conference: 28.-31.08.2024
Publisher: Tekli, Joe ; Gamper, Johann ; Chbeir, Richard ; Manolopoulos, Yannis
Place of publication: Berlin [u.a.]
Publishing house: Springer
ISBN: 978-3-031-70628-8 , 978-3-031-70626-4
ISSN: 0302-9743 , 1611-3349
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Subject: 004 Computer science, internet
Keywords (English): information extraction , value normalization , Large Language Models
Abstract: Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadata export


Citation


+ Search Authors in

+ Page Views

Hits per month over past year

Detailed information



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item