LLM4DDC: Adopting Large Language Models for research data classification using Dewey Decimal Classification


Shahi, Gautam Kishore ; Shigapov, Renat ; Hummel, Oliver


[img]
Preview
PDF
heiBOOKS-1652-978-3-911056-51-9-CH40.pdf - Published

Download (1MB)

DOI: https://doi.org/10.11588/heibooks.1652.c23948
URL: https://books.ub.uni-heidelberg.de/heibooks/catalo...
URN: urn:nbn:de:bsz:180-madoc-711513
Document Type: Conference or workshop publication
Year of publication: 2025
Book title: E-Science-Tage 2025: research data management: challenges in a changing world
Page range: 476-484
Conference title: E-Science-Tage 2025
Location of the conference venue: Heidelberg, Germany
Date of the conference: 12.-14.03.2025
Publisher: Heuveline, Vincent ; Kling, Philipp ; Heuschkel, Florian ; Habinger, Sophie G. ; Krömer, Cora F.
Place of publication: Heidelberg
Publishing house: heiBOOKS
Publication language: English
Institution: Zentrale Einrichtungen > University Library
Pre-existing license: Creative Commons Attribution, Share Alike 4.0 International (CC BY-SA 4.0)
Subject: 004 Computer science, internet
Abstract: Classifying research data in institutional repositories is time-consuming and challenging. While the Dewey Decimal Classification (DDC) system is widely used in subject classification for texts, its application to research data metadata has been limited so far. This study explores the possible use of large language models (LLMs) and small language models (SLMs) for the automatic classification of research data in the context of DDC. This study uses sample data from an existing dataset compiled from different institutions mainly in Germany. We use a prompt engineering approach for LLMs, and fine tuning for SLMs, where we use RoBERTa as a baseline. Our results show that LLMs with prompt engineering currently are not able to classify metadata of research data into DDC classes as good as SLMs with fine tuning. To foster adoption, we openly release our models, code, and datasets for integration into research data infrastructures at GitHub.


SDG 9: Industry, Innovation and Infrastructure


Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.

Dieser Datensatz wurde nicht während einer Tätigkeit an der Universität Mannheim veröffentlicht, dies ist eine Externe Publikation.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item