Establishing standards for human-annotated samples applied in supervised machine learning - evidence from a Monte Carlo simulation
Oschatz, Corinna
;
Sältzer, Marius
;
Stier, Sebastian
DOI:
|
https://doi.org/10.5771/2192-4007-2023-4-289
|
URL:
|
https://www.nomos-elibrary.de/10.5771/2192-4007-20...
|
Weitere URL:
|
https://www.researchgate.net/publication/376432118...
|
URN:
|
urn:nbn:de:bsz:180-madoc-679226
|
Dokumenttyp:
|
Zeitschriftenartikel
|
Erscheinungsjahr:
|
2023
|
Titel einer Zeitschrift oder einer Reihe:
|
Studies in Communication and Media : SCM
|
Band/Volume:
|
12
|
Heft/Issue:
|
4
|
Seitenbereich:
|
289-304
|
Ort der Veröffentlichung:
|
Baden-Baden
|
Verlag:
|
Nomos
|
ISSN:
|
2192-4007
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Sozialwissenschaften > Computational Social Science (Stier 2023-)
|
Bereits vorhandene Lizenz:
|
Creative Commons Namensnennung, nicht kommerziell, keine Bearbeitung 4.0 International (CC BY-NC-ND 4.0)
|
Fachgebiet:
|
070 Nachrichtenmedien, Journalismus, Verlagswesen 300 Sozialwissenschaften, Soziologie, Anthropologie
|
Freie Schlagwörter (Englisch):
|
supervised machine learning , prediction accuracy , impact of conder errors , impact of curation strategies , Monte Carlo simulation
|
Abstract:
|
Automated content analyses have become a popular tool in communication science. While standard procedures for manual content analysis were established decades ago, it remains an open question whether these standards are sufficient for the use of human-annotated data to train supervised machine learning models. Scholars typically follow a two-stage procedure to obtain high prediction accuracy: manual content analysis followed by model training with human-annotated samples. We argue that a loss in prediction accuracy in supervised machine learning builds up over this two-stage procedure. In a Monte Carlo simulation, we tested (1) human coder errors (random, individual systematic, joint systematic) and (2) curation strategies for human-annotated datasets (one coder per document, majority rule, full agreement) as two sequential sources of accuracy loss of automated content analysis. Coder agreement prior to conducting manual content analysis remains an important quality criterion for automated content analyses. A Krippendorff’s alpha of at least 0.8 is desirable to achieve satisfying prediction results after machine learning. Systematic errors (individual and joint) must be avoided at all costs. The best training samples were obtained using one coder per document or the majority coding curation strategy. Ultimately, this paper can help researchers produce trustworthy predictions when combining manual coding and machine learning.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Suche Autoren in
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|