Dual-objective fine-tuning of BERT for entity matching

Peeters, Ralph ; Bizer, Christian

[img] PDF
p1913-peeters.pdf - Published

Download (960kB)

DOI: https://doi.org/10.14778/3467861.3467878
URL: https://madoc.bib.uni-mannheim.de/59958
Additional URL: https://dl.acm.org/doi/10.14778/3467861.3467878
URN: urn:nbn:de:bsz:180-madoc-599585
Document Type: Conference or workshop publication
Year of publication: 2021
Book title: 47th International Conference on Very Large Data Bases (VLDB 2021) : Copenhagen, Denmark, August 16-20, 2021
The title of a journal, publication series: Proceedings of the VLDB Endowment
Volume: 14,10
Page range: 1913-1921
Conference title: VLDB 2021, 47th International Conference on Very Large Data Bases
Location of the conference venue: København, Denmark, Hybrid
Date of the conference: 16.-20.08.2021
Place of publication: New York, NY
Publishing house: Association of Computing Machinery
ISSN: 2150-8097
Related URLs:
Publication language: English
Institution: School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)
Pre-existing license: Creative Commons Attribution, Non-Commercial, No Derivatives 4.0 International (CC BY-NC-ND 4.0)
Subject: 004 Computer science, internet
Keywords (English): Identity Resolution , Entity Matching , Deep Learning , BERT , Shared Identifiers , Data Integration
Abstract: An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.
Additional information: Online-Ressource

Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.

Metadata export


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics

You have found an error? Please let us know about your desired correction here: E-Mail

Actions (login required)

Show item Show item