Entity extraction from Wikipedia list pages

DOI:	https://doi.org/10.1007/978-3-030-49461-2_19
URL:	https://link.springer.com/chapter/10.1007%2F978-3-...
Weitere URL:	https://arxiv.org/abs/2003.05146
Dokumenttyp:	Konferenzveröffentlichung
Erscheinungsjahr:	2020
Buchtitel:	The Semantic Web : 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31-June 4, 2020, Proceedings
Titel einer Zeitschrift oder einer Reihe:	Lecture Notes in Computer Science
Band/Volume:	12123
Seitenbereich:	327-342
Veranstaltungstitel:	ESWC 2020
Veranstaltungsort:	Online
Veranstaltungsdatum:	31.05.-04.06.2020
Herausgeber:	Harth, Andreas
Ort der Veröffentlichung:	Berlin [u.a.]
Verlag:	Springer
ISBN:	978-3-030-49460-5 , 978-3-030-49462-9 , 978-3-030-49461-2
ISSN:	0302-9743 , 1611-3349
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Data Science (Paulheim 2018-)
Fachgebiet:	004 Informatik
Abstract:	When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia’s policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia’s list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.