Establishing standards for human-annotated samples applied in supervised machine learning - evidence from a Monte Carlo simulation

Oschatz, Corinna ; Sältzer, Marius ; Stier, Sebastian

Vorschau

PDF
2192-4007-2023-4-289.pdf - Veröffentlichte Version
Download (1MB)

DOI:	https://doi.org/10.5771/2192-4007-2023-4-289
URL:	https://www.nomos-elibrary.de/10.5771/2192-4007-20...
Weitere URL:	https://www.researchgate.net/publication/376432118...
URN:	urn:nbn:de:bsz:180-madoc-679226
Dokumenttyp:	Zeitschriftenartikel
Erscheinungsjahr:	2023
Titel einer Zeitschrift oder einer Reihe:	Studies in Communication and Media : SCM
Band/Volume:	12
Heft/Issue:	4
Seitenbereich:	289-304
Ort der Veröffentlichung:	Baden-Baden
Verlag:	Nomos
ISSN:	2192-4007
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Sozialwissenschaften > Computational Social Science (Stier 2023-)
Bereits vorhandene Lizenz:	Creative Commons Namensnennung, nicht kommerziell, keine Bearbeitung 4.0 International (CC BY-NC-ND 4.0)
Fachgebiet:	070 Nachrichtenmedien, Journalismus, Verlagswesen 300 Sozialwissenschaften, Soziologie, Anthropologie
Freie Schlagwörter (Englisch):	supervised machine learning , prediction accuracy , impact of conder errors , impact of curation strategies , Monte Carlo simulation
Abstract:	Automated content analyses have become a popular tool in communication science. While standard procedures for manual content analysis were established decades ago, it remains an open question whether these standards are sufficient for the use of human-annotated data to train supervised machine learning models. Scholars typically follow a two-stage procedure to obtain high prediction accuracy: manual content analysis followed by model training with human-annotated samples. We argue that a loss in prediction accuracy in supervised machine learning builds up over this two-stage procedure. In a Monte Carlo simulation, we tested (1) human coder errors (random, individual systematic, joint systematic) and (2) curation strategies for human-annotated datasets (one coder per document, majority rule, full agreement) as two sequential sources of accuracy loss of automated content analysis. Coder agreement prior to conducting manual content analysis remains an important quality criterion for automated content analyses. A Krippendorff’s alpha of at least 0.8 is desirable to achieve satisfying prediction results after machine learning. Systematic errors (individual and joint) must be avoided at all costs. The best training samples were obtained using one coder per document or the majority coding curation strategy. Ultimately, this paper can help researchers produce trustworthy predictions when combining manual coding and machine learning.

	Dieser Eintrag ist Teil der Universitätsbibliographie.
	Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.

Suche Autoren in

BASE: Oschatz, Corinna ; Sältzer, Marius ; Stier, Sebastian

Google Scholar: Oschatz, Corinna ; Sältzer, Marius ; Stier, Sebastian

ORCID: Oschatz, Corinna ; Sältzer, Marius ; Stier, Sebastian

Download-Statistik

Downloads im letzten Jahr

Detaillierte Angaben

Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail

Actions (login required)

Eintrag anzeigen

Establishing standards for human-annotated samples applied in supervised machine learning - evidence from a Monte Carlo simulation

Metadaten-Export

Zitation

Suche Autoren in

Download-Statistik

Actions (login required)