Profiling the semantics of n-ary web table data


Lehmberg, Oliver ; Bizer, Christian



DOI: https://doi.org/10.1145/3323878.3325806
URL: https://dl.acm.org/doi/10.1145/3323878.3325806
Document Type: Conference or workshop publication
Year of publication: 2019
Book title: Proceedings of the International Workshop on Semantic Big Data : SBD '19, Amsterdam, The Netherlands, June 30 - July 5, 2019
Page range: 5:1-5:6
Conference title: SBD '19
Location of the conference venue: Amsterdam, The Netherlands
Date of the conference: 05.07.2019
Publisher: Groppe, Sven ; Gruenwald, Le
Place of publication: New York, NY, USA
Publishing house: ACM
ISBN: 978-1-4503-6766-0
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Subject: 004 Computer science, internet
Individual keywords (German): information extraction , data profiling , key detection , n-ary relations , web tables
Abstract: The Web contains millions of relational HTML tables, which cover a multitude of different, often very specific topics. This rich pool of data has motivated a growing body of research on methods that use web table data to extend local tables with additional attributes or add missing facts to knowledge bases. Nearly all existing approaches for these tasks build upon the assumption that web table data consists of binary relations, meaning that an attribute value depends on a single key attribute, and that the key attribute value is contained in the HTML table. Inspecting randomly chosen tables on the Web, however, quickly reveals that both assumptions are wrong for a large fraction of the tables. In order to better understand the potential of non-binary web table data for downstream applications, this papers analyses a corpus of 5 million web tables originating from 80 thousand different web sites with respect to how many web table attributes are non-binary, what composite keys are required to correctly interpret the semantics of the non-binary attributes, and whether the values of these keys are found in the table itself or need to be extracted from the page surrounding the table. The profiling of the corpus shows that at least 38% of the relations are non-binary. Recognizing these relations requires information from the title or the URL of the web page in 50% of the cases. We find that different websites use keys of varying length for the same dependent attribute, e.g. one cluster of websites presents employment numbers depending on time, another cluster presents them depending on time and profession. By identifying these clusters, we lay the foundation for selecting Web data sources according to the specificity of the keys that are used to determine specific attributes.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadata export


Citation


+ Search Authors in

+ Page Views

Hits per month over past year

Detailed information



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item