Transformer-based subject entity detection in Wikipedia listings

PDF
paper-2.pdf - Veröffentlichte Version
Download (483kB)

URL:	https://ceur-ws.org/Vol-3342/paper-2.pdf
Weitere URL:	https://ceur-ws.org/Vol-3342/
URN:	urn:nbn:de:bsz:180-madoc-647880
Dokumenttyp:	Konferenzveröffentlichung
Erscheinungsjahr:	2022
Buchtitel:	Proceedings of the 5th Workshop on Deep Learning for Knowlege Graphs (DL4KG 2022) co-located with the 21th International Semantic Web Conference (ISWC 2022)
Titel einer Zeitschrift oder einer Reihe:	CEUR Workshop Proceedings
Band/Volume:	3342
Seitenbereich:	1-16
Veranstaltungstitel:	Deep Learning for Knowledge Graphs Workshop (DL4KG) @ ISWC 2022
Veranstaltungsort:	Online
Veranstaltungsdatum:	24.10.2022
Herausgeber:	Alam, Mehwish ; Buscaldi, Davide ; Cochez, Michael ; Osborne, Francesco ; Reforgiato Recupero, Diego
Ort der Veröffentlichung:	Aachen, Germany
Verlag:	RWTH Aachen
ISSN:	1613-0073
Verwandte URLs:	https://alammehwish.github.io/dl4kg2022/
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Data Science (Paulheim 2018-)
Bereits vorhandene Lizenz:	Creative Commons Namensnennung 4.0 International (CC BY 4.0)
Fachgebiet:	004 Informatik
Abstract:	In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - and in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities on token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extract 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph.