The Open Automation and Control Systems Journal
2014, 6 : 1277-1286Published online 2014 December 31. DOI: 10.2174/1874444301406011277
Publisher ID: TOAUTOCJ-6-1277
TA-DRD: A Three-step Automatic Duplicate Record Detection
ABSTRACT
Duplicate record detection is a key step in Deep Web data integration, but the existing approaches do not adapt to its large-scale nature. In this paper, a three-step automatic approach is proposed for duplicate record detection in Deep Web. It firstly uses cluster ensemble to select initial training instance. Then it utilizes tri-training classification to construct classification model. Finally, it uses evidence theory to combine the results of multiple classification models to construct the domain-level duplicate record detection model which can be used for large-scale duplicate record detection in the same domain. Experimental results show that the proposed approach is better than previous work and and the domainlevel duplicate record detection model can get high performance.