Extracting research data from historical documents with eScriptorium and Python
Kamlah, Jan
;
Schmidt, Thomas
;
Shigapov, Renat
|
PDF
NFDI-Workshop-Research-Data-Maschinenindustrie-DE.pdf
- Veröffentlichte Version
Download (2MB)
|
DOI:
|
https://doi.org/10.5281/zenodo.7373134
|
URL:
|
https://zenodo.org/record/7373134
|
URN:
|
urn:nbn:de:bsz:180-madoc-638217
|
Dokumenttyp:
|
Präsentation auf Konferenz
|
Erscheinungsjahr:
|
2022
|
Veranstaltungstitel:
|
Focused Tutorial on Capturing, Enriching, Disseminating Research Data Objects, Use Cases from Text+, NFDI4Culture and BERD@NFDI
|
Veranstaltungsort:
|
Mannheim, Germany
|
Veranstaltungsdatum:
|
24.-25.11.2022
|
Verwandte URLs:
|
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Zentrale Einrichtungen > UB Universitätsbibliothek
|
Bereits vorhandene Lizenz:
|
Creative Commons Namensnennung 4.0 International (CC BY 4.0)
|
Fachgebiet:
|
004 Informatik
|
Abstract:
|
This talk presents a workflow based on eScriptorium and Python to extract research data from historical documents. eScriptorium is a rather young transcription tool and uses the OCR engine Kraken. The software offers not only the possibility of optimally adapting the text recognition, but also the layout recognition to the source material by means of training. Due to the high research data quality requirements, this step is necessary in many cases. By using existing base models, the training effort can be drastically reduced. The text recognition results can then be exported in PAGE-XML format for further processing. For this purpose, the Python tool “blatt” was developed within the project. It can parse the PAGE-XML exports, sort and extract the contents using algorithms and templates, and convert them into a structured table format such as CSV. In the first part of the presentation there is small introduction to the topic, the source material and the research question. Then we show how a training process based on a base model with minimal training data can be performed using the software eScriptorium and which problem to pay attention to. In the last section, the Python tool “blatt” is presented, as well as the underlying ideas and algorithms.
|
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Suche Autoren in
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|