SC-block: Supervised contrastive blocking within entity resolution pipelines


Brinkmann, Alexander ; Shraga, Roee ; Bizer, Christian



DOI: https://doi.org/10.1007/978-3-031-60626-7_7
URL: https://link.springer.com/chapter/10.1007/978-3-03...
Document Type: Conference or workshop publication
Year of publication: 2024
Book title: The Semantic Web : 21st International Conference, ESWC 2024, Hersonissos, Crete, Greece, May 26–30, 2024, Proceedings, Part I
The title of a journal, publication series: Lecture Notes in Computer Science
Volume: 14664
Page range: 121-142
Conference title: ESWC 2024, Extended Semantic Web Conference
Location of the conference venue: Hersonissos, Crete, Greece
Date of the conference: 26.-30.05.2024
Publisher: Meroño Peñuela, Albert ; Dimou, Anastasia ; Troncy, Raphaël ; Hartig, Olaf ; Acosta, Maribel ; Alam, Mehwish ; Paulheim, Heiko ; Lisena, Pasquale
Place of publication: Berlin [u.a.]
Publishing house: Springer
ISBN: 978-3-031-60625-0 , 978-3-031-60626-7
ISSN: 0302-9743 , 1611-3349
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Subject: 004 Computer science, internet
Keywords (English): identity resolution , blocking , schema.org , benchmarking , supervised contrastive learning
Abstract: Millions of websites use the schema.org vocabulary to annotate structured data describing products, local businesses, or events within their HTML pages. Integrating schema.org data from the Semantic Web poses distinct requirements to entity resolution methods: (1) the methods must scale to millions of entity descriptions and (2) the methods must be able to deal with the heterogeneity that results from a large number of data sources. In order to scale to numerous entity descriptions, entity resolution methods combine a blocker for candidate pair selection and a matcher for the fine-grained comparison of the pairs in the candidate set. This paper introduces SC-Block, a blocking method that uses supervised contrastive learning to cluster entity descriptions in an embedding space. The embedding enables SC-Block to generate small candidate sets even for use cases that involve a large number of unique tokens within entity descriptions. To measure the effectiveness of blocking methods for Semantic Web use cases, we present a new benchmark, WDC-Block. WDC-Block requires blocking product offers from 3,259 e-shops that use the schema.org vocabulary. The benchmark has a maximum Cartesian product of 200 billion pairs of offers and a vocabulary size of 7 million unique tokens. Our experiments using WDC-Block and other blocking benchmarks demonstrate that SC-Block produces candidate sets that are on average 50% smaller than the candidate sets generated by competing blocking methods. Entity resolution pipelines that combine SC-Block with state-of-the-art matchers finish 1.5 to 4 times faster than pipelines using other blockers, without any loss in F1 score.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadata export


Citation


+ Search Authors in

BASE: Brinkmann, Alexander ; Shraga, Roee ; Bizer, Christian

Google Scholar: Brinkmann, Alexander ; Shraga, Roee ; Bizer, Christian

ORCID: Brinkmann, Alexander ORCID: 0000-0002-9379-2048 ; Shraga, Roee ; Bizer, Christian ["search_editors_ORCID" not defined] Meroño Peñuela, Albert ; Dimou, Anastasia ; Troncy, Raphaël ; Hartig, Olaf ; Acosta, Maribel ; Alam, Mehwish ; Paulheim, Heiko ORCID: 0000-0003-4386-8195 ; Lisena, Pasquale

+ Page Views

Hits per month over past year

Detailed information



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item