The Open Cybernetics & Systemics Journal
2015, 9 : 37-43Published online 2015 February 20. DOI: 10.2174/1874110X01509010037
Publisher ID: TOCSJ-9-37
De-Duplication Scheduling Strategy in Real-Time Data Warehouse
ABSTRACT
Data quality of the data warehouse is crucial to decision-makers. Data duplication is considered one of the critical factors that affect the data quality. Therefore, data de-duplication is an essential process for data warehousing. Particularly, for a real-time data warehouse, it is necessary to ensure not only the data quality in real-time, but also the performance of the front-end queries and analysis. The scheduling strategy of de-duplication in a real-time data warehouse should be well studied. In this paper, we firstly investigate the three kinds of data de-duplication scheduling strategies named De-duplication Prior scheduling Strategy (DPS), Real-time scheduling Strategy (RS) and ETL Prior scheduling Strategy (EPS); then propose a new Time-Triggered scheduling Strategy (TTS) which belongs to EPS; finally evaluate the performance of the proposed scheduling strategy through experiments. This work is contributed to the efficient data cleaning and application of real-time data warehouse.