Developmentally plausible multimodal language models are highly modular


Klerings, Alina ; Bartelt, Christian ; Mueller, Aaron


[img] PDF
2024.conll-babylm.10.pdf - Published

Download (3MB)

URL: https://aclanthology.org/2024.conll-babylm.10/
URN: urn:nbn:de:bsz:180-madoc-693784
Document Type: Conference or workshop publication
Year of publication: 2024
Book title: The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning : proceedings of the Second BabyLM Challenge : November 15-16, 2024
Page range: 118-139
Conference title: CoNLL 2024, The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Location of the conference venue: Miami, FL
Date of the conference: 15.-16.11.2024
Publisher: Hu, Michael Y. ; Mueller, Aaron ; Ross, Candace ; Williams, Adina ; Linzen, Tal ; Zhuang, Chengxu ; Choshen, Leshem ; Cotterell, Ryan ; Warstadt, Alex ; Wilcox, Ethan Gotlieb
Place of publication: Miami, FL, USA
Publishing house: Association for Computational Linguistics
ISBN: 79-8-89176-222-0
Publication language: English
Institution: Außerfakultäre Einrichtungen > Institut für Enterprise Systems (InES)
Pre-existing license: Creative Commons Attribution 4.0 International (CC BY 4.0)
Subject: 004 Computer science, internet
Abstract: Large language models demonstrate emergent modularity, where functionally specialized components and circuits arise to handle specific tasks or task formats. If similar modules arise in models trained on more cognitively plausible datasets, it could inform debates surrounding what kinds of would be learnable given more human-like language learning signals. In this paper, we describe a multimodal vision-language model submitted to the BabyLM Challenge. Our model achieves similar performance to the best-performing architectures from last year, though visual information does not improve performance on text-only tasks over text-only models (in accordance with prior findings). To better understand how the model processes the evaluation tasks of the BabyLM Challenge, we leverage causal interpretability methods to locate the neurons that contribute to the model`s final decisions. We find that the models we train are highly modular: distinct components arise to process related tasks. Furthermore, on text-and-image tasks, adding or removing visual inputs causes the model to use distinct components to process the same textual inputs. This suggests that modal and task-specific specialization is efficiently learned, and that a high degree of functional specialization arises in even small-scale language models.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item