Web-scale profiling of semantic annotations in HTML pages

Meusel, Robert

Vorschau

PDF
thesis_final_rm_20170322-1.pdf - Veröffentlichte Version
Download (4MB)

URL:	https://madoc.bib.uni-mannheim.de/41884
URN:	urn:nbn:de:bsz:180-madoc-418842
Dokumenttyp:	Dissertation
Erscheinungsjahr:	2017
Ort der Veröffentlichung:	Mannheim
Hochschule:	Universität Mannheim
Gutachter:	Bizer, Christian
Datum der mündl. Prüfung:	10 März 2017
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet:	004 Informatik
Normierte Schlagwörter (SWD):	Data-Profiling
Freie Schlagwörter (Englisch):	Dataspace Profiling , RDFa , Microformats , Microdata , Schema.org , Crawling
Abstract:	The vision of the Semantic Web was coined by Tim Berners-Lee almost two decades ago. The idea describes an extension of the existing Web in which “information is given well-deﬁned meaning, better enabling computers and people to work in cooperation” [Berners-Lee et al., 2001]. Semantic annotations in HTML pages are one realization of this vision which was adopted by large numbers of web sites in the last years. Semantic annotations are integrated into the code of HTML pages using one of the three markup languages Microformats, RDFa, or Microdata. Major consumers of semantic annotations are the search engine companies Bing, Google, Yahoo!, and Yandex. They use semantic annotations from crawled web pages to enrich the presentation of search results and to complement their knowledge bases. However, outside the large search engine companies, little is known about the deployment of semantic annotations: How many web sites deploy semantic annotations? What are the topics covered by semantic annotations? How detailed are the annotations? Do web sites use semantic annotations correctly? Are semantic annotations useful for others than the search engine companies? And how can semantic annotations be gathered from the Web in that case? The thesis answers these questions by proﬁling the web-wide deployment of semantic annotations. The topic is approached in three consecutive steps: In the ﬁrst step, two approaches for extracting semantic annotations from the Web are discussed. The thesis evaluates ﬁrst the technique of focused crawling for harvesting semantic annotations. Afterward, a framework to extract semantic annotations from existing web crawl corpora is described. The two extraction approaches are then compared for the purpose of analyzing the deployment of semantic annotations in the Web. In the second step, the thesis analyzes the overall and markup language-speciﬁc adoption of semantic annotations. This empirical investigation is based on the largest web corpus that is available to the public. Further, the topics covered by deployed semantic annotations and their evolution over time are analyzed. Subsequent studies examine common errors within semantic annotations. In addition, the thesis analyzes the data overlap of the entities that are described by semantic annotations from the same and across different web sites. The third step narrows the focus of the analysis towards use case-speciﬁc issues. Based on the requirements of a marketplace, a news aggregator, and a travel portal the thesis empirically examines the utility of semantic annotations for these use cases. Additional experiments analyze the capability of product-related semantic annotations to be integrated into an existing product categorization schema. Especially, the potential of exploiting the diverse category information given by the web sites providing semantic annotations is evaluated.
Übersetzung des Abstracts:	Vor mehr als 20 Jahren veröffentlichte Tim Berners-Lee seine Idee des Semantic Webs. Basierend auf seiner Vision, sollte das semantische Web eine Erweiterung des bestehenden Webs sein, in dem die enthaltenen Informationen semantisch definiert sind, wodurch die Kooperation zwischen Mensch und Maschine vereinfacht werden würde. [Berners-Lee et al., 2001] Semantische Annotationen in HTML Seiten sind eine konkrete Umsetzung dieser Idee, die in den letzten Jahren von sehr vielen Webseitenbetreibern adaptiert wurden. Semantische Annotationen werden direkt im HTML Quellcode der Webseite mithilfe der drei HTML-Markup-Erweiterungen Microformats, RDFa, und Microdata eingefügt. Hauptsächlich werden so annotierte Informationen von den großen Suchmaschinenfirmen, Bing, Google, Yahoo! oder Yandex verarbeitet. Diese Firmen nutzen semantische Annotationen, die sie in dem HTML Quellcode von gecrawlten Webseiten finden, um die Anzeige von Suchergebnissen zu verbessern oder ihren internen Wissensgraphen zu erweitern. Trotz der starken Nutzung durch die Suchmaschinenfirmen ist wenig über die Einbindung und Verbreitung von semantischen Annotationen im Web bekannt: Wie viele Webseiten bieten semantische Annotationen an? Welche Themengebiete werden beschrieben? Wie detailliert sind die annotierten Informationen und nutzen Webseitenbetreiber die Annotationen korrekt? Sind die so angebotenen Informationen nützlich und wie können sie effizient gesammelt werden? Diese Fragen werden in den drei, aufeinanderfolgenden Teilen dieser Dissertation im Zuge einer umfassenden Profilierung des Datenraumes, der von semantischen Annotationen aufgespannt wird, beantwortet. Im ersten Teil werden zwei Möglichkeiten zur Sammlung von semantischen Annotationen diskutiert. Zuerst evaluiert die Dissertation eine Methodik, die sich an der Idee des fokussierten Crawlens orientiert. Daraufhin wird ein Framework vorgestellt, welches semantische Annotationen aus bestehenden Webcrawldatensätzen extrahieren kann. Beide Vorgehensweisen werden verglichen und mit Bezug auf die Repräsentativität der gewonnen Daten evaluiert. Im zweiten Teil analysiert die Arbeit empirisch die allgemeine, wie auch markupspezifische Verbreitung von semantischen Annotationen im Web basierend auf dem größten öffentlich zugänglichen Webcrawldatensatzes. Über die Verbreitung hinaus werden die enthaltenen Themengebiete sowie deren Veränderung über die Zeit betrachtet. Nachfolgend untersucht die Arbeit, zu welchem Grad Webseitenbetreiber semantische Annotationen korrekt benutzen. Der abschließende Teil der Arbeit fokussiert sich auf eine anwendungsbezogene Analyse von semantischen Annotationen. Basierend auf den Anforderungen eines Onlineshops, einer Nachrichtenaggregationsseite und eines Reiseportals wird die Nützlichkeit von semantischen Annotationen evaluiert. Anschließend wird untersucht, in wie weit es möglich ist, die seitenspezifischen Produktkategorisierungen zu nutzen um Produktinformationen, auf eine bestehende Produktklassifizierung abzubilden und somit eine feingranulare Themenanalyse zu ermöglichen. (Deutsch)