Dual-objective fine-tuning of BERT for entity matching
Peeters, Ralph
;
Bizer, Christian
DOI:
|
https://doi.org/10.14778/3467861.3467878
|
URL:
|
https://madoc.bib.uni-mannheim.de/59958
|
Additional URL:
|
https://dl.acm.org/doi/10.14778/3467861.3467878
|
URN:
|
urn:nbn:de:bsz:180-madoc-599585
|
Document Type:
|
Conference or workshop publication
|
Year of publication:
|
2021
|
Book title:
|
47th International Conference on Very Large Data Bases (VLDB 2021) : Copenhagen, Denmark, August 16-20, 2021
|
The title of a journal, publication series:
|
Proceedings of the VLDB Endowment
|
Volume:
|
14,10
|
Page range:
|
1913-1921
|
Conference title:
|
VLDB 2021, 47th International Conference on Very Large Data Bases
|
Location of the conference venue:
|
København, Denmark, Hybrid
|
Date of the conference:
|
16.-20.08.2021
|
Place of publication:
|
New York, NY
|
Publishing house:
|
Association of Computing Machinery
|
ISSN:
|
2150-8097
|
Related URLs:
|
|
Publication language:
|
English
|
Institution:
|
School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
|
Pre-existing license:
|
Creative Commons Attribution, Non-Commercial, No Derivatives 4.0 International (CC BY-NC-ND 4.0)
|
Subject:
|
004 Computer science, internet
|
Keywords (English):
|
Identity Resolution , Entity Matching , Deep Learning , BERT , Shared Identifiers , Data Integration
|
Abstract:
|
An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.
|
Additional information:
|
Online-Ressource
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Search Authors in
You have found an error? Please let us know about your desired correction here: E-Mail
Actions (login required)
|
Show item |
|
|