Exploiting general-purpose background knowledge for automated schema matching

Portisch, Jan

Vorschau

PDF
dissertation_portisch.pdf - Veröffentlichte Version
Download (7MB)

URN:	urn:nbn:de:bsz:180-madoc-628036
Dokumenttyp:	Dissertation
Erscheinungsjahr:	2022
Ort der Veröffentlichung:	Mannheim
Hochschule:	Universität Mannheim
Gutachter:	Paulheim, Heiko
Datum der mündl. Prüfung:	25 August 2022
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Data Science (Paulheim 2018-)
Lizenz:	Creative Commons Namensnennung 4.0 International (CC BY 4.0)
Fachgebiet:	004 Informatik
Freie Schlagwörter (Deutsch):	Datenintegration , Hintergrundwissen , Kontextwissen , Schema Matching , Wissensgraphen
Freie Schlagwörter (Englisch):	ontology matching , schema matching , ontology alignment , background knowledge , context knowledge , knowledge graph matching
Abstract:	The schema matching task is an integral part of the data integration process. It is usually the first step in integrating data. Schema matching is typically very complex and time-consuming. It is, therefore, to the largest part, carried out by humans. One reason for the low amount of automation is the fact that schemas are often defined with deep background knowledge that is not itself present within the schemas. Overcoming the problem of missing background knowledge is a core challenge in automating the data integration process. In this dissertation, the task of matching semantic models, so-called ontologies, with the help of external background knowledge is investigated in-depth in Part I. Throughout this thesis, the focus lies on large, general-purpose resources since domain-specific resources are rarely available for most domains. Besides new knowledge resources, this thesis also explores new strategies to exploit such resources. A technical base for the development and comparison of matching systems is presented in Part II. The framework introduced here allows for simple and modularized matcher development (with background knowledge sources) and for extensive evaluations of matching systems. One of the largest structured sources for general-purpose background knowledge are knowledge graphs which have grown significantly in size in recent years. However, exploiting such graphs is not trivial. In Part III, knowledge graph em- beddings are explored, analyzed, and compared. Multiple improvements to existing approaches are presented. In Part IV, numerous concrete matching systems which exploit general-purpose background knowledge are presented. Furthermore, exploitation strategies and resources are analyzed and compared. This dissertation closes with a perspective on real-world applications.
Übersetzung des Abstracts:	Schema Matching ist ein wesentlicher Bestandteil des Datenintegrationsprozesses. Es stellt typischerweise den ersten Schritt der Datenintegration dar. Schema Matching ist sehr komplex und zeitaufwändig. Es wird – zu großen Teilen – noch immer von Menschen ausgeführt. Ein Grund für den niedrigen Grad der Automation hierbei ist die Tatsache, dass Schemata sehr oft mit Kontextwissen modelliert werden, welches letztendlich jedoch nicht Teil des Schemas wird. In der vorliegenden Dissertation wird das Matching semantischer Modelle, sogenannter Ontologien, unter Zuhilfenahme externen Kontextwissens grundlegend erforscht; dies geschieht in Teil I dieser Arbeit. Ein Fokus liegt hierbei auf großen, allgemein gefassten Wissensressourcen, da fachspezifische Ressourcen für die meisten Domänen nur selten verfügbar sind. Neben der Untersuchung neuer Wissensressourcen werden in dieser Dissertation auch Methoden betrachtet, um solche Ressourcen sinnvoll zu nutzen. Eine technische Grundlage für die Entwicklung und den Vergleich von Matchingsystemen wird in Teil II vorgestellt. Das hier eingeführte Framework erlaubt einfaches, gegebenenfalls kontextwissenbasiertes, sowie modulbasiertes Entwickeln von Softwareartefakten. Ferner bietet das vorgestelle Framework umfassende Möglichkeiten der Evaluation von Matchingsystemen. Eine der größten strukturierten Ressourcen für allgemein gefasste Wissensressourcen sind Wissensgraphen (sogenannte knowledge graphs), welche in den letzten Jahren wesentlich gewachsen sind. Nichtsdestotrotz ist die Nutzung solcher Wissensgraphen nicht trivial. Teil III dieser Arbeit untersucht, analysiert und vergleicht sogenannte knowledge graph embeddings. Mehrere Verbesserungen bereits existierender Verfahren werden vorgestellt. In Teil IV werden zahlreiche konkrete Matchingsysteme, welche allgemein gefasste Wissensressourcen nutzen, vorgestellt. Zudem werden Nutzungsstrategien und Ressourcen analysiert und verglichen. Diese Dissertation wird mit einem Blick auf praxisorientierte Anwendungsfälle abgerundet. (Deutsch)