Dual-objective fine-tuning of BERT for entity matching

PDF
p1913-peeters.pdf - Veröffentlichte Version
Download (960kB)

DOI:	https://doi.org/10.14778/3467861.3467878
URL:	https://madoc.bib.uni-mannheim.de/59958
Weitere URL:	https://dl.acm.org/doi/10.14778/3467861.3467878
URN:	urn:nbn:de:bsz:180-madoc-599585
Dokumenttyp:	Konferenzveröffentlichung
Erscheinungsjahr:	2021
Buchtitel:	47th International Conference on Very Large Data Bases (VLDB 2021) : Copenhagen, Denmark, August 16-20, 2021
Titel einer Zeitschrift oder einer Reihe:	Proceedings of the VLDB Endowment
Band/Volume:	14,10
Seitenbereich:	1913-1921
Veranstaltungstitel:	VLDB 2021, 47th International Conference on Very Large Data Bases
Veranstaltungsort:	København, Denmark, Hybrid
Veranstaltungsdatum:	16.-20.08.2021
Ort der Veröffentlichung:	New York, NY
Verlag:	Association of Computing Machinery
ISSN:	2150-8097
Verwandte URLs:	http://www.vldb.org/pvldb/vol14/p1913-pe...
Sprache der Veröffentlichung:	Englisch
Einrichtung:	Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Bereits vorhandene Lizenz:	Creative Commons Namensnennung, nicht kommerziell, keine Bearbeitung 4.0 International (CC BY-NC-ND 4.0)
Fachgebiet:	004 Informatik
Freie Schlagwörter (Englisch):	Identity Resolution , Entity Matching , Deep Learning , BERT , Shared Identifiers , Data Integration
Abstract:	An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.
Zusätzliche Informationen:	Online-Ressource