Classification of Named Entities in a large multilingual resource using the Wikipedia category system

Knopp, Johannes

Document Type: Final Thesis , Magister
Year of publication: 2010
Place of publication: Heidelberg
University: Universität Heidelberg
Evaluator: Frank, Anette , Riezler, Stefan
Date of oral examination: 25 January 2010
Publication language: English
Institution: School of Business Informatics and Mathematics > Praktische Informatik II (Stuckenschmidt 2009-)
Subject: 004 Computer science, internet
Subject headings (SWD): Computerlinguistik
Keywords (English): Named Entity recognition , Lexicon , lexical database , Multilinguality
Abstract: Over the last 15 years the role of named entities became more and more impor- tant in natural language processing (NLP). Their information is crucial for tasks in information extraction like coreference resolution or relationship extraction. As recent systems mostly rely on machine learning techniques, their performance is based on the size and quality of given training data. This data is expensive and cumbersome to create because usually experts annotate corpora manually to achieve high quality data. As a result, these data sets often lack coverage, are not up to date and are not available in many languages. To overcome this problem, semi-automatic methods for resource construction from other available sources were deployed. One of these sources is Wikipedia, a free collaboratively created online encyclopedia, which was explored for several NLP tasks over the last years. Although it is not created by linguists, meta information about articles such as translations, disambiguations or categorisations are available. In addition, Wikipedia is growing fast: it is available in more than 260 languages and contains more than three million articles in the English version. The structural features, its size and multilingual availability provide a suitable base to derive specialised resources that can be used as training data for ma- chine learning. One of them is HeiNER – the Heidelberg Named Entity Resource (Wentland et al., 2008). HeiNER contains a huge multilingual collection of named entities including their contexts taken from Wikipedia. However, there is one dis- advantage: it has no knowledge of which type its named entities are. Hence, the idea of this thesis is to add the named entity types Person, Organisation, Location and Miscellaneous to HeiNER’s entries. Wikipedia’s Category system is utilised to solve this problem. We identify categories that unambiguously match a named entity type in order to classify all articles found in them automatically. Counting the categories of these new classified articles results in named entity type vectors that are used to classify the yet unlabelled named entities that are members of HeiNER.

Dieser Datensatz wurde nicht während einer Tätigkeit an der Universität Mannheim veröffentlicht, dies ist eine Externe Publikation.

Metadata export


+ Search Authors in

+ Page Views

Hits per month over past year

Detailed information

You have found an error? Please let us know about your desired correction here: E-Mail

Actions (login required)

Show item Show item