Classification of Named Entities in a large multilingual resource using the Wikipedia category system


Knopp, Johannes



URL: http://www.cl.uni-heidelberg.de/~knopp/pub/thesis....
Dokumenttyp: Abschlussarbeit , Magister
Erscheinungsjahr: 2010
Ort der Veröffentlichung: Heidelberg
Hochschule: Universität Heidelberg
Gutachter: Frank, Anette , Riezler, Stefan
Datum der mündl. Prüfung: 25 Januar 2010
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Practical Computer Science II: Artificial Intelligence (Stuckenschmidt 2009-)
Fachgebiet: 004 Informatik
Normierte Schlagwörter (SWD): Computerlinguistik
Freie Schlagwörter (Englisch): Named Entity recognition , Lexicon , lexical database , Multilinguality
Abstract: Over the last 15 years the role of named entities became more and more impor- tant in natural language processing (NLP). Their information is crucial for tasks in information extraction like coreference resolution or relationship extraction. As recent systems mostly rely on machine learning techniques, their performance is based on the size and quality of given training data. This data is expensive and cumbersome to create because usually experts annotate corpora manually to achieve high quality data. As a result, these data sets often lack coverage, are not up to date and are not available in many languages. To overcome this problem, semi-automatic methods for resource construction from other available sources were deployed. One of these sources is Wikipedia, a free collaboratively created online encyclopedia, which was explored for several NLP tasks over the last years. Although it is not created by linguists, meta information about articles such as translations, disambiguations or categorisations are available. In addition, Wikipedia is growing fast: it is available in more than 260 languages and contains more than three million articles in the English version. The structural features, its size and multilingual availability provide a suitable base to derive specialised resources that can be used as training data for ma- chine learning. One of them is HeiNER – the Heidelberg Named Entity Resource (Wentland et al., 2008). HeiNER contains a huge multilingual collection of named entities including their contexts taken from Wikipedia. However, there is one dis- advantage: it has no knowledge of which type its named entities are. Hence, the idea of this thesis is to add the named entity types Person, Organisation, Location and Miscellaneous to HeiNER’s entries. Wikipedia’s Category system is utilised to solve this problem. We identify categories that unambiguously match a named entity type in order to classify all articles found in them automatically. Counting the categories of these new classified articles results in named entity type vectors that are used to classify the yet unlabelled named entities that are members of HeiNER.




Dieser Datensatz wurde nicht während einer Tätigkeit an der Universität Mannheim veröffentlicht, dies ist eine Externe Publikation.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Aufruf-Statistik

Aufrufe im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen