The Open Automation and Control Systems Journal

2014, 6 : 1277-1286
Published online 2014 December 31. DOI: 10.2174/1874444301406011277
Publisher ID: TOAUTOCJ-6-1277

TA-DRD: A Three-step Automatic Duplicate Record Detection

Yongquan Dong , Ping Ling , Yali Liu and Qiang Chu
School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China.

ABSTRACT

Duplicate record detection is a key step in Deep Web data integration, but the existing approaches do not adapt to its large-scale nature. In this paper, a three-step automatic approach is proposed for duplicate record detection in Deep Web. It firstly uses cluster ensemble to select initial training instance. Then it utilizes tri-training classification to construct classification model. Finally, it uses evidence theory to combine the results of multiple classification models to construct the domain-level duplicate record detection model which can be used for large-scale duplicate record detection in the same domain. Experimental results show that the proposed approach is better than previous work and and the domainlevel duplicate record detection model can get high performance.

Keywords:

Duplicate Record Detection, Deep Web, Data Integration, random measurement, image reconstruction.