Extracting research data from historical documents with eScriptorium and Python

Kamlah, Jan ; Schmidt, Thomas ; Shigapov, Renat

Vorschau

PDF
NFDI-Workshop-Research-Data-Maschinenindustrie-DE.pdf - Veröffentlichte Version
Download (2MB)

DOI:	https://doi.org/10.5281/zenodo.7373134
URL:	https://zenodo.org/record/7373134
URN:	urn:nbn:de:bsz:180-madoc-638217
Dokumenttyp:	Präsentation auf Konferenz
Erscheinungsjahr:	2022
Veranstaltungstitel:	Focused Tutorial on Capturing, Enriching, Disseminating Research Data Objects, Use Cases from Text+, NFDI4Culture and BERD@NFDI
Veranstaltungsort:	Mannheim, Germany
Veranstaltungsdatum:	24.-25.11.2022
Verwandte URLs:	https://www.berd-nfdi.de/focused-tutoria...
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Zentrale Einrichtungen > UB Universitätsbibliothek
Bereits vorhandene Lizenz:	Creative Commons Namensnennung 4.0 International (CC BY 4.0)
Fachgebiet:	004 Informatik
Abstract:	This talk presents a workflow based on eScriptorium and Python to extract research data from historical documents. eScriptorium is a rather young transcription tool and uses the OCR engine Kraken. The software offers not only the possibility of optimally adapting the text recognition, but also the layout recognition to the source material by means of training. Due to the high research data quality requirements, this step is necessary in many cases. By using existing base models, the training effort can be drastically reduced. The text recognition results can then be exported in PAGE-XML format for further processing. For this purpose, the Python tool “blatt” was developed within the project. It can parse the PAGE-XML exports, sort and extract the contents using algorithms and templates, and convert them into a structured table format such as CSV. In the first part of the presentation there is small introduction to the topic, the source material and the research question. Then we show how a training process based on a base model with minimal training data can be performed using the software eScriptorium and which problem to pay attention to. In the last section, the Python tool “blatt” is presented, as well as the underlying ideas and algorithms.