SC-block: Supervised contrastive blocking within entity resolution pipelines
Brinkmann, Alexander
;
Shraga, Roee
;
Bizer, Christian
DOI:
|
https://doi.org/10.1007/978-3-031-60626-7_7
|
URL:
|
https://link.springer.com/chapter/10.1007/978-3-03...
|
Dokumenttyp:
|
Konferenzveröffentlichung
|
Erscheinungsjahr:
|
2024
|
Buchtitel:
|
The Semantic Web : 21st International Conference, ESWC 2024, Hersonissos, Crete, Greece, May 26–30, 2024, Proceedings, Part I
|
Titel einer Zeitschrift oder einer Reihe:
|
Lecture Notes in Computer Science
|
Band/Volume:
|
14664
|
Seitenbereich:
|
121-142
|
Veranstaltungstitel:
|
ESWC 2024, Extended Semantic Web Conference
|
Veranstaltungsort:
|
Hersonissos, Crete, Greece
|
Veranstaltungsdatum:
|
26.-30.05.2024
|
Herausgeber:
|
Meroño Peñuela, Albert
;
Dimou, Anastasia
;
Troncy, Raphaël
;
Hartig, Olaf
;
Acosta, Maribel
;
Alam, Mehwish
;
Paulheim, Heiko
;
Lisena, Pasquale
|
Ort der Veröffentlichung:
|
Berlin [u.a.]
|
Verlag:
|
Springer
|
ISBN:
|
978-3-031-60625-0 , 978-3-031-60626-7
|
ISSN:
|
0302-9743 , 1611-3349
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
|
Fachgebiet:
|
004 Informatik
|
Freie Schlagwörter (Englisch):
|
identity resolution , blocking , schema.org , benchmarking , supervised contrastive learning
|
Abstract:
|
Millions of websites use the schema.org vocabulary to annotate structured data describing products, local businesses, or events within their HTML pages. Integrating schema.org data from the Semantic Web poses distinct requirements to entity resolution methods: (1) the methods must scale to millions of entity descriptions and (2) the methods must be able to deal with the heterogeneity that results from a large number of data sources. In order to scale to numerous entity descriptions, entity resolution methods combine a blocker for candidate pair selection and a matcher for the fine-grained comparison of the pairs in the candidate set. This paper introduces SC-Block, a blocking method that uses supervised contrastive learning to cluster entity descriptions in an embedding space. The embedding enables SC-Block to generate small candidate sets even for use cases that involve a large number of unique tokens within entity descriptions. To measure the effectiveness of blocking methods for Semantic Web use cases, we present a new benchmark, WDC-Block. WDC-Block requires blocking product offers from 3,259 e-shops that use the schema.org vocabulary. The benchmark has a maximum Cartesian product of 200 billion pairs of offers and a vocabulary size of 7 million unique tokens. Our experiments using WDC-Block and other blocking benchmarks demonstrate that SC-Block produces candidate sets that are on average 50% smaller than the candidate sets generated by competing blocking methods. Entity resolution pipelines that combine SC-Block with state-of-the-art matchers finish 1.5 to 4 times faster than pipelines using other blockers, without any loss in F1 score.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
Suche Autoren in
BASE:
Brinkmann, Alexander
;
Shraga, Roee
;
Bizer, Christian
Google Scholar:
Brinkmann, Alexander
;
Shraga, Roee
;
Bizer, Christian
ORCID:
Brinkmann, Alexander ORCID: 0000-0002-9379-2048 ; Shraga, Roee ; Bizer, Christian ORCID: 0000-0003-2367-0237
["search_editors_ORCID" not defined]
Meroño Peñuela, Albert ; Dimou, Anastasia ; Troncy, Raphaël ; Hartig, Olaf ; Acosta, Maribel ; Alam, Mehwish ; Paulheim, Heiko ORCID: 0000-0003-4386-8195 ; Lisena, Pasquale
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|