Neural data search for table augmentation

Brinkmann, Alexander

PDF
PhDWorkshop_2023_brinkmann-paper.pdf - Veröffentlichte Version
Download (242kB)

URL:	https://ceur-ws.org/Vol-3379/PhDWorkshop_2023_brin...
URN:	urn:nbn:de:bsz:180-madoc-643986
Dokumenttyp:	Konferenzveröffentlichung
Erscheinungsjahr:	2023
Buchtitel:	Proceedings of the Workshops of the EDBT/ICDT 2023 Joint Conference, Ioannina, Greece, March, 28, 2023
Titel einer Zeitschrift oder einer Reihe:	CEUR Workshop Proceedings
Band/Volume:	3379
Seitenbereich:	1-4
Veranstaltungstitel:	EDBT/ICDT Workshops 2023
Veranstaltungsort:	Ioannina, Greece
Veranstaltungsdatum:	28.03.2023
Herausgeber:	Fletcher, George ; Kantere, Verena
Ort der Veröffentlichung:	Aachen, Germany
Verlag:	RWTH Aachen
ISSN:	1613-0073
Verwandte URLs:	https://ceur-ws.org/Vol-3379/
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Bereits vorhandene Lizenz:	Creative Commons Namensnennung 4.0 International (CC BY 4.0)
Fachgebiet:	004 Informatik
Abstract:	Tabular data is widely available on the web and in private data lakes run by commercial companies or research institutes. However, data that is essential for a specific task at hand is often scattered throughout numerous tables in these data lakes. Accessing this data requires retrieving the relevant information for the task. One approach to retrieve this data is through table augmentation. Table augmentation adds an additional attribute to a query table and populates the values of that attribute with data from the data lake. My research focuses on evaluating methods for augmenting a table with an additional attribute. Table augmentation presents a variety of challenges due to the heterogeneity of data sources and the multitude of possible combinations of methods. To successfully augment a query table based on tabular data from a data lake, several tasks such as data normalization, data search, schema matching, information extraction and data fusion must be performed. In my work, I empirically compare methods for data search, information extraction and data fusion as well as complete table augmentation pipelines using different datasets containing tabular data found in real-world data lakes. Methodologically, I plan to introduce new neural techniques for data search, information extraction and data fusion in the context of table augmentation. These new methods, as well as existing symbolic data search methods for table augmentation, will be empirically evaluated on two sets of benchmark query tables. The aim is to identify task- and dataset-specific challenges for data search, information extraction and data fusion methods. By profiling the datasets and analysing the errors made by the evaluated methods on the test query tables, the strengths and weaknesses of the methods can be systematically identified. Data search and information extraction methods should maximize recall while data fusion methods should achieve high accuracy. Pipelines built on the basis of the new methods should deliver their results quickly without compromising the highest possible accuracy of the augmented attribute values.