Using LLMs for the extraction and normalization of product attribute values
Brinkmann, Alexander
;
Baumann, Nick
;
Bizer, Christian
DOI:
|
https://doi.org/10.1007/978-3-031-70626-4_15
|
URL:
|
https://link.springer.com/chapter/10.1007/978-3-03...
|
Document Type:
|
Conference or workshop publication
|
Year of publication:
|
2024
|
Book title:
|
Advances in databases and information systems : 28th European Conference, ADBIS 2024, Bayonne, France, August 28-31, 2024 ; Proceedings
|
The title of a journal, publication series:
|
Lecture Notes in Computer Science
|
Volume:
|
14918
|
Page range:
|
217-230
|
Conference title:
|
28th European Conference on Advances in Databases and Information Systems (ADBIS 2024)
|
Location of the conference venue:
|
Bayonne, France
|
Date of the conference:
|
28.-31.08.2024
|
Publisher:
|
Tekli, Joe
;
Gamper, Johann
;
Chbeir, Richard
;
Manolopoulos, Yannis
|
Place of publication:
|
Berlin [u.a.]
|
Publishing house:
|
Springer
|
ISBN:
|
978-3-031-70628-8 , 978-3-031-70626-4
|
ISSN:
|
0302-9743 , 1611-3349
|
Publication language:
|
English
|
Institution:
|
School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
|
Subject:
|
004 Computer science, internet
|
Keywords (English):
|
information extraction , value normalization , Large Language Models
|
Abstract:
|
Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
Search Authors in
You have found an error? Please let us know about your desired correction here: E-Mail
Actions (login required)
|
Show item |
|