Zaturrawiah Ali Omar and Zamira Hasanah Zamzuri and Noratiqah Mohd Ariff and Mohd Aftar Abu Bakar (2023) Training data selection for record linkage classification. Symmetry, 15. pp. 1-17.
Text
ABSTRACT.pdf Download (38kB) |
|
Text
FULL TEXT.pdf Restricted to Registered users only Download (1MB) | Request a copy |
Abstract
This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F1 -score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F1 -score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.
Item Type: | Article |
---|---|
Keyword: | Record linkage, Unsupervised random forest, Similarity measure, Training data |
Subjects: | Q Science > QA Mathematics > QA1-939 Mathematics Q Science > QA Mathematics > QA1-939 Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science |
Department: | FACULTY > Faculty of Science and Natural Resources |
Depositing User: | SITI AZIZAH BINTI IDRIS - |
Date Deposited: | 10 Dec 2024 14:57 |
Last Modified: | 10 Dec 2024 14:57 |
URI: | https://eprints.ums.edu.my/id/eprint/42203 |
Actions (login required)
View Item |