Dual-objective fine-tuning of BERT for entity matching
Peeters, Ralph
;
Bizer, Christian
DOI:
|
https://doi.org/10.14778/3467861.3467878
|
URL:
|
https://madoc.bib.uni-mannheim.de/59958
|
Weitere URL:
|
https://dl.acm.org/doi/10.14778/3467861.3467878
|
URN:
|
urn:nbn:de:bsz:180-madoc-599585
|
Dokumenttyp:
|
Konferenzveröffentlichung
|
Erscheinungsjahr:
|
2021
|
Buchtitel:
|
47th International Conference on Very Large Data Bases (VLDB 2021) : Copenhagen, Denmark, August 16-20, 2021
|
Titel einer Zeitschrift oder einer Reihe:
|
Proceedings of the VLDB Endowment
|
Band/Volume:
|
14,10
|
Seitenbereich:
|
1913-1921
|
Veranstaltungstitel:
|
VLDB 2021, 47th International Conference on Very Large Data Bases
|
Veranstaltungsort:
|
København, Denmark, Hybrid
|
Veranstaltungsdatum:
|
16.-20.08.2021
|
Ort der Veröffentlichung:
|
New York, NY
|
Verlag:
|
Association of Computing Machinery
|
ISSN:
|
2150-8097
|
Verwandte URLs:
|
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
|
Bereits vorhandene Lizenz:
|
Creative Commons Namensnennung, nicht kommerziell, keine Bearbeitung 4.0 International (CC BY-NC-ND 4.0)
|
Fachgebiet:
|
004 Informatik
|
Freie Schlagwörter (Englisch):
|
Identity Resolution , Entity Matching , Deep Learning , BERT , Shared Identifiers , Data Integration
|
Abstract:
|
An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.
|
Zusätzliche Informationen:
|
Online-Ressource
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Suche Autoren in
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|