Extracting research data from historical documents with eScriptorium and Python


Kamlah, Jan ; Schmidt, Thomas ; Shigapov, Renat


[img] PDF
NFDI-Workshop-Research-Data-Maschinenindustrie-DE.pdf - Published

Download (2MB)

DOI: https://doi.org/10.5281/zenodo.7373134
URL: https://zenodo.org/record/7373134
URN: urn:nbn:de:bsz:180-madoc-638217
Document Type: Conference presentation
Year of publication: 2022
Conference title: Focused Tutorial on Capturing, Enriching, Disseminating Research Data Objects, Use Cases from Text+, NFDI4Culture and BERD@NFDI
Location of the conference venue: Mannheim, Germany
Date of the conference: 24.-25.11.2022
Related URLs:
Publication language: English
Institution: Zentrale Einrichtungen > University Library
Pre-existing license: Creative Commons Attribution 4.0 International (CC BY 4.0)
Subject: 004 Computer science, internet
Abstract: This talk presents a workflow based on eScriptorium and Python to extract research data from historical documents. eScriptorium is a rather young transcription tool and uses the OCR engine Kraken. The software offers not only the possibility of optimally adapting the text recognition, but also the layout recognition to the source material by means of training. Due to the high research data quality requirements, this step is necessary in many cases. By using existing base models, the training effort can be drastically reduced. The text recognition results can then be exported in PAGE-XML format for further processing. For this purpose, the Python tool “blatt” was developed within the project. It can parse the PAGE-XML exports, sort and extract the contents using algorithms and templates, and convert them into a structured table format such as CSV. In the first part of the presentation there is small introduction to the topic, the source material and the research question. Then we show how a training process based on a base model with minimal training data can be performed using the software eScriptorium and which problem to pay attention to. In the last section, the Python tool “blatt” is presented, as well as the underlying ideas and algorithms.




Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item