Column type annotation using ChatGPT


Korini, Keti ; Bizer, Christian


[img] PDF
TADA1.pdf - Published

Download (1MB)

URL: https://ceur-ws.org/Vol-3462/TADA1.pdf
URN: urn:nbn:de:bsz:180-madoc-651328
Document Type: Conference or workshop publication
Year of publication: 2023
Book title: Joint proceedings of workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023, VLDBW 2023
The title of a journal, publication series: CEUR Workshop Proceedings
Volume: 3462
Page range: 1-12
Conference title: Workshop on Tabular Data Analysis (TaDA) @ VLDB 2023
Location of the conference venue: Vancouver, Canada
Date of the conference: 01.09.2023
Publisher: Bordawekar, Rajesh ; Cappiello, Cinzia ; Efthymiou, Vasilis
Place of publication: Aachen, Germany
Publishing house: RWTH Aachen
ISSN: 1613-0073
Related URLs:
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Pre-existing license: Creative Commons Attribution 4.0 International (CC BY 4.0)
Subject: 004 Computer science, internet
Keywords (English): table annotation , column type annotation , LLMs , ChatGPT
Abstract: Column type annotation is the task of annotating the columns of a relational table with the semantic type of the values contained in each column. Column type annotation is an important pre-processing step for data search and data integration in the context of data lakes. State-of-the-art column type annotation methods either rely on matching table columns to properties of a knowledge graph or fine-tune pre-trained language models such as BERT for column type annotation. In this work, we take a different approach and explore using ChatGPT for column type annotation. We evaluate different prompt designs in zero- and few-shot settings and experiment with providing task definitions and detailed instructions to the model. We further implement a two-step table annotation pipeline which first determines the class of the entities described in the table and depending on this class asks ChatGPT to annotate columns using only the relevant subset of the overall vocabulary. Using instructions as well as the two-step pipeline, ChatGPT reaches F1 scores of over 85% in zero- and one-shot setups. To reach a similar F1 score a RoBERTa model needs to be fine-tuned with 356 examples. This comparison shows that ChatGPT is able deliver competitive results for the column type annotation task given no or only a minimal amount of task-specific demonstrations.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadata export


Citation


+ Search Authors in

BASE: Korini, Keti ; Bizer, Christian

Google Scholar: Korini, Keti ; Bizer, Christian

ORCID: Korini, Keti ; Bizer, Christian ORCID: 0000-0003-2367-0237

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item