The Open Cybernetics & Systemics Journal

2015, 9 : 37-43
Published online 2015 February 20. DOI: 10.2174/1874110X01509010037
Publisher ID: TOCSJ-9-37

De-Duplication Scheduling Strategy in Real-Time Data Warehouse

Jie Song , Hui Liu , JinBo Wu and Yu-Bin Bao
Software College, Northeastern University, Shenyang 110819, China.

ABSTRACT

Data quality of the data warehouse is crucial to decision-makers. Data duplication is considered one of the critical factors that affect the data quality. Therefore, data de-duplication is an essential process for data warehousing. Particularly, for a real-time data warehouse, it is necessary to ensure not only the data quality in real-time, but also the performance of the front-end queries and analysis. The scheduling strategy of de-duplication in a real-time data warehouse should be well studied. In this paper, we firstly investigate the three kinds of data de-duplication scheduling strategies named De-duplication Prior scheduling Strategy (DPS), Real-time scheduling Strategy (RS) and ETL Prior scheduling Strategy (EPS); then propose a new Time-Triggered scheduling Strategy (TTS) which belongs to EPS; finally evaluate the performance of the proposed scheduling strategy through experiments. This work is contributed to the efficient data cleaning and application of real-time data warehouse.

Keywords:

Data warehouse, ETL, real-time, de-duplication, scheduling strategy.