Transformer-based subject entity detection in Wikipedia listings


Heist, Nicolas ; Paulheim, Heiko


[img] PDF
paper-2.pdf - Published

Download (483kB)

URL: https://ceur-ws.org/Vol-3342/paper-2.pdf
Additional URL: https://ceur-ws.org/Vol-3342/
URN: urn:nbn:de:bsz:180-madoc-647880
Document Type: Conference or workshop publication
Year of publication: 2022
Book title: Proceedings of the 5th Workshop on Deep Learning for Knowlege Graphs (DL4KG 2022) co-located with the 21th International Semantic Web Conference (ISWC 2022)
The title of a journal, publication series: CEUR Workshop Proceedings
Volume: 3342
Page range: 1-16
Conference title: Deep Learning for Knowledge Graphs Workshop (DL4KG) @ ISWC 2022
Location of the conference venue: Online
Date of the conference: 24.10.2022
Publisher: Alam, Mehwish ; Buscaldi, Davide ; Cochez, Michael ; Osborne, Francesco ; Reforgiato Recupero, Diego
Place of publication: Aachen, Germany
Publishing house: RWTH Aachen
ISSN: 1613-0073
Related URLs:
Publication language: English
Institution: School of Business Informatics and Mathematics > Data Science (Paulheim 2018-)
Pre-existing license: Creative Commons Attribution 4.0 International (CC BY 4.0)
Subject: 004 Computer science, internet
Abstract: In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - and in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities on token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extract 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item