Establishing standards for human-annotated samples applied in supervised machine learning - evidence from a Monte Carlo simulation


Oschatz, Corinna ; Sältzer, Marius ; Stier, Sebastian


[img] PDF
2192-4007-2023-4-289.pdf - Published

Download (1MB)

DOI: https://doi.org/10.5771/2192-4007-2023-4-289
URL: https://www.nomos-elibrary.de/10.5771/2192-4007-20...
Additional URL: https://www.researchgate.net/publication/376432118...
URN: urn:nbn:de:bsz:180-madoc-679226
Document Type: Article
Year of publication: 2023
The title of a journal, publication series: Studies in Communication and Media : SCM
Volume: 12
Issue number: 4
Page range: 289-304
Place of publication: Baden-Baden
Publishing house: Nomos
ISSN: 2192-4007
Publication language: English
Institution: School of Social Sciences > Computational Social Science (Stier 2023-)
Pre-existing license: Creative Commons Attribution, Non-Commercial, No Derivatives 4.0 International (CC BY-NC-ND 4.0)
Subject: 070 News media, journalism, publishing
300 Social sciences, sociology, anthropology
Keywords (English): supervised machine learning , prediction accuracy , impact of conder errors , impact of curation strategies , Monte Carlo simulation
Abstract: Automated content analyses have become a popular tool in communication science. While standard procedures for manual content analysis were established decades ago, it remains an open question whether these standards are sufficient for the use of human-annotated data to train supervised machine learning models. Scholars typically follow a two-stage procedure to obtain high prediction accuracy: manual content analysis followed by model training with human-annotated samples. We argue that a loss in prediction accuracy in supervised machine learning builds up over this two-stage procedure. In a Monte Carlo simulation, we tested (1) human coder errors (random, individual systematic, joint systematic) and (2) curation strategies for human-annotated datasets (one coder per document, majority rule, full agreement) as two sequential sources of accuracy loss of automated content analysis. Coder agreement prior to conducting manual content analysis remains an important quality criterion for automated content analyses. A Krippendorff’s alpha of at least 0.8 is desirable to achieve satisfying prediction results after machine learning. Systematic errors (individual and joint) must be avoided at all costs. The best training samples were obtained using one coder per document or the majority coding curation strategy. Ultimately, this paper can help researchers produce trustworthy predictions when combining manual coding and machine learning.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item