New data-driven approaches to text simplification
Štajner, Sanja
URL:
|
http://wlv.openrepository.com/wlv/handle/2436/5544...
|
URN:
|
http://hdl.handle.net/2436/554413
|
Dokumenttyp:
|
Dissertation
|
Erscheinungsjahr:
|
2015
|
Ort der Veröffentlichung:
|
Wolverhampton, United Kingdom
|
Hochschule:
|
University of Wolverhampton
|
Gutachter:
|
Mitkov, Ruslan
|
Datum der mündl. Prüfung:
|
25 März 2015
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Practical Computer Science II: Artificial Intelligence (Stuckenschmidt 2009-) Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Semantic Web (Juniorprofessur) (Ponzetto 2013-2015)
|
Fachgebiet:
|
004 Informatik
|
Normierte Schlagwörter (SWD):
|
Textverstehendes System , Automatische Sprachanalyse , Natürliche Sprache
|
Freie Schlagwörter (Englisch):
|
text simplification , natural language processing
|
Abstract:
|
Many texts we encounter in our everyday lives are lexically and syntactically very complex.
This makes them difficult to understand for people with intellectual or reading
impairments, and difficult for various natural language processing systems to process.
This motivated the need for text simplification (TS) which transforms texts into their
simpler variants. Given that this is still a relatively new research area, many challenges
are still remaining. The focus of this thesis is on better understanding the current problems
in automatic text simplification (ATS) and proposing new data-driven approaches
to solving them.
We propose methods for learning sentence splitting and deletion decisions, built
upon parallel corpora of original and manually simplified Spanish texts, which outperform
the existing similar systems. Our experiments in adaptation of those methods to
different text genres and target populations report promising results, thus offering one
possible solution for dealing with the scarcity of parallel corpora for text simplification
aimed at specific target populations, which is currently one of the main issues in ATS.
The results of our extensive analysis of the phrase-based statistical machine translation
(PB-SMT) approach to ATS reject the widespread assumption that the success
of that approach largely depends on the size of the training and development datasets.
They indicate more influential factors for the success of the PB-SMT approach to ATS,
and reveal some important differences between cross-lingual MT and the monolingual
v
MT used in ATS.
Our event-based system for simplifying news stories in English (EventSimplify)
overcomes some of the main problems in ATS. It does not require a large number
of handcrafted simplification rules nor parallel data, and it performs significant content
reduction. The automatic and human evaluations conducted show that it produces grammatical
text and increases readability, preserving and simplifying relevant content and
reducing irrelevant content.
Finally, this thesis addresses another important issue in TS which is how to automatically
evaluate the performance of TS systems given that access to the target users
might be difficult. Our experiments indicate that existing readability metrics can successfully
be used for this task when enriched with human evaluation of grammaticality
and preservation of meaning.
|
| Dieser Datensatz wurde nicht während einer Tätigkeit an der Universität Mannheim veröffentlicht, dies ist eine Externe Publikation. |
Suche Autoren in
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|