SC-block: Supervised contrastive blocking within entity resolution pipelines


Brinkmann, Alexander ; Shraga, Roee ; Bizer, Christian



DOI: https://doi.org/10.1007/978-3-031-60626-7_7
URL: https://link.springer.com/chapter/10.1007/978-3-03...
Dokumenttyp: Konferenzveröffentlichung
Erscheinungsjahr: 2024
Buchtitel: The Semantic Web : 21st International Conference, ESWC 2024, Hersonissos, Crete, Greece, May 26–30, 2024, Proceedings, Part I
Titel einer Zeitschrift oder einer Reihe: Lecture Notes in Computer Science
Band/Volume: 14664
Seitenbereich: 121-142
Veranstaltungstitel: ESWC 2024, Extended Semantic Web Conference
Veranstaltungsort: Hersonissos, Crete, Greece
Veranstaltungsdatum: 26.-30.05.2024
Herausgeber: Meroño Peñuela, Albert ; Dimou, Anastasia ; Troncy, Raphaël ; Hartig, Olaf ; Acosta, Maribel ; Alam, Mehwish ; Paulheim, Heiko ; Lisena, Pasquale
Ort der Veröffentlichung: Berlin [u.a.]
Verlag: Springer
ISBN: 978-3-031-60625-0 , 978-3-031-60626-7
ISSN: 0302-9743 , 1611-3349
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet: 004 Informatik
Freie Schlagwörter (Englisch): identity resolution , blocking , schema.org , benchmarking , supervised contrastive learning
Abstract: Millions of websites use the schema.org vocabulary to annotate structured data describing products, local businesses, or events within their HTML pages. Integrating schema.org data from the Semantic Web poses distinct requirements to entity resolution methods: (1) the methods must scale to millions of entity descriptions and (2) the methods must be able to deal with the heterogeneity that results from a large number of data sources. In order to scale to numerous entity descriptions, entity resolution methods combine a blocker for candidate pair selection and a matcher for the fine-grained comparison of the pairs in the candidate set. This paper introduces SC-Block, a blocking method that uses supervised contrastive learning to cluster entity descriptions in an embedding space. The embedding enables SC-Block to generate small candidate sets even for use cases that involve a large number of unique tokens within entity descriptions. To measure the effectiveness of blocking methods for Semantic Web use cases, we present a new benchmark, WDC-Block. WDC-Block requires blocking product offers from 3,259 e-shops that use the schema.org vocabulary. The benchmark has a maximum Cartesian product of 200 billion pairs of offers and a vocabulary size of 7 million unique tokens. Our experiments using WDC-Block and other blocking benchmarks demonstrate that SC-Block produces candidate sets that are on average 50% smaller than the candidate sets generated by competing blocking methods. Entity resolution pipelines that combine SC-Block with state-of-the-art matchers finish 1.5 to 4 times faster than pipelines using other blockers, without any loss in F1 score.




Dieser Eintrag ist Teil der Universitätsbibliographie.




Metadaten-Export


Zitation


+ Suche Autoren in

BASE: Brinkmann, Alexander ; Shraga, Roee ; Bizer, Christian

Google Scholar: Brinkmann, Alexander ; Shraga, Roee ; Bizer, Christian

ORCID: Brinkmann, Alexander ORCID: 0000-0002-9379-2048 ; Shraga, Roee ; Bizer, Christian ["search_editors_ORCID" not defined] Meroño Peñuela, Albert ; Dimou, Anastasia ; Troncy, Raphaël ; Hartig, Olaf ; Acosta, Maribel ; Alam, Mehwish ; Paulheim, Heiko ORCID: 0000-0003-4386-8195 ; Lisena, Pasquale

+ Aufruf-Statistik

Aufrufe im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen