The Open Automation and Control Systems Journal

2015, 7 : 2039-2043
Published online 2015 October 27. DOI: 10.2174/1874444301507012039
Publisher ID: TOAUTOCJ-7-2039

Research and Realization of the Extensible Data Cleaning Framework EDCF

Xu Chengyu and Gao Wei
State Grid Shanxi Electric Power Corporation, Shanxi, Taiyuan 030001.

ABSTRACT

This paper proposes the idea of establishing an extensible data cleaning framework which is based on the key technology of data cleaning, and the framework includes open rules library and algorithms library. This paper provides the descriptions of model principle and working process of the extensible data cleaning framework, and the validity of the framework is verified by experiment. When the data are being cleaned, all the errors in the data source can be cleaned according to the specific business by the predefined rules of the cleaning and choosing the appropriate algorithm. The last stage of the realization initially completes the basic functions of data cleaning module in the framework, and the framework which has good efficiency and operation effect is verified by the experiment.

Keywords:

Data Cleaning, Clustering, Outlier, Approximately Duplicated Records, The Extensible Data Cleaning Framework.