Classification of Named Entities in a large multilingual resource using the Wikipedia category system
Knopp, Johannes
URL:
|
http://www.cl.uni-heidelberg.de/~knopp/pub/thesis....
|
Document Type:
|
Final Thesis
, Magister
|
Year of publication:
|
2010
|
Place of publication:
|
Heidelberg
|
University:
|
Universität Heidelberg
|
Evaluator:
|
Frank, Anette , Riezler, Stefan
|
Date of oral examination:
|
25 January 2010
|
Publication language:
|
English
|
Institution:
|
School of Business Informatics and Mathematics > Practical Computer Science II: Artificial Intelligence (Stuckenschmidt 2009-)
|
Subject:
|
004 Computer science, internet
|
Subject headings (SWD):
|
Computerlinguistik
|
Keywords (English):
|
Named Entity recognition , Lexicon , lexical database , Multilinguality
|
Abstract:
|
Over the last 15 years the role of named entities became more and more impor-
tant in natural language processing (NLP). Their information is crucial for tasks
in information extraction like coreference resolution or relationship extraction.
As recent systems mostly rely on machine learning techniques, their performance
is based on the size and quality of given training data. This data is expensive
and cumbersome to create because usually experts annotate corpora manually to
achieve high quality data. As a result, these data sets often lack coverage, are not
up to date and are not available in many languages.
To overcome this problem, semi-automatic methods for resource construction
from other available sources were deployed. One of these sources is Wikipedia,
a free collaboratively created online encyclopedia, which was explored for several
NLP tasks over the last years. Although it is not created by linguists, meta
information about articles such as translations, disambiguations or categorisations
are available. In addition, Wikipedia is growing fast: it is available in more than
260 languages and contains more than three million articles in the English version.
The structural features, its size and multilingual availability provide a suitable
base to derive specialised resources that can be used as training data for ma-
chine learning. One of them is HeiNER – the Heidelberg Named Entity Resource
(Wentland et al., 2008). HeiNER contains a huge multilingual collection of named
entities including their contexts taken from Wikipedia. However, there is one dis-
advantage: it has no knowledge of which type its named entities are. Hence, the
idea of this thesis is to add the named entity types Person, Organisation, Location
and Miscellaneous to HeiNER’s entries.
Wikipedia’s Category system is utilised to solve this problem. We identify
categories that unambiguously match a named entity type in order to classify
all articles found in them automatically. Counting the categories of these new
classified articles results in named entity type vectors that are used to classify the
yet unlabelled named entities that are members of HeiNER.
|
| Dieser Datensatz wurde nicht während einer Tätigkeit an der Universität Mannheim veröffentlicht, dies ist eine Externe Publikation. |
Search Authors in
You have found an error? Please let us know about your desired correction here: E-Mail
Actions (login required)
|
Show item |
|
|