The internet never forgets: A four-step scraping tutorial, codebase, and database for longitudinal organizational website data


Haans, Richard F.J. ; Mertens, Marc J.


[img] PDF
haans-mertens-2024-the-internet-never-forgets-a-four-step-scraping-tutorial-codebase-and-database-for-longitudinal.pdf - Published

Download (1MB)

DOI: https://doi.org/10.1177/10944281241284941
URL: https://journals.sagepub.com/doi/10.1177/109442812...
Additional URL: https://www.researchgate.net/publication/385550928...
URN: urn:nbn:de:bsz:180-madoc-684725
Document Type: Article
Year of publication Online: 2024
Date: 4 November 2024
The title of a journal, publication series: Organizational Research Methods : ORM
Volume: tba
Issue number: tba
Page range: 1-29
Place of publication: Thousand Oaks, CA
Publishing house: Sage
ISSN: 1094-4281 , 1552-7425
Publication language: English
Institution: Business School > Strategisches u. Internat. Management (Brauer 2014-)
Pre-existing license: Creative Commons Attribution, Non-Commercial 4.0 International (CC BY-NC 4.0)
Subject: 330 Economics
Keywords (English): websites, web scraping, Wayback Machine, textual data, Compustat
Abstract: Websites represent a crucial avenue for organizations to reach customers, attract talent, and disseminate information to stakeholders. Despite their importance, strikingly little work in the domain of organization and management research has tapped into this source of longitudinal big data. In this paper, we highlight the unique nature and profound potential of longitudinal website data and present novel open-source code- and databases that make these data accessible. Specifically, our codebase offers a general-purpose setup, building on four central steps to scrape historical websites using the Wayback Machine. Our open-access CompuCrawl database was built using this four-step approach. It contains websites of North American firms in the Compustat database between 1996 and 2020—covering 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages. We describe the coverage of our database and illustrate its use by applying word-embedding models to reveal the evolving meaning of the concept of “sustainability” over time. Finally, we outline several avenues for future research enabled by our step-by-step longitudinal web scraping approach and our CompuCrawl database.


Economic SustainabilityEnvironmental SustainabilitySocial Sustainability


Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.

Diese Publikation ist bisher nur Online erschienen. Diese Publikation nun als "Jetzt in Print erschienen" melden.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item