The Open Cybernetics & Systemics Journal

2015, 9 : 473-477
Published online 2015 May 29. DOI: 10.2174/1874110X01509010473
Publisher ID: TOCSJ-9-473

Mining Non-overlapping Repetitive Sequential Patterns by Improving GSP Algorithm

Yongshun Gong , Xiangjun Dong , Xiqing Han and Ruilian Hou
School of Information, Qilu University of Technology, Shandong, Jinan, 250353, P.R China.

ABSTRACT

Repetitive sequence mining plays very important roles and has been widely studied in DNA or genome, but there are only a few researches in sequence database. Taking sequence <ababababc> for example, traditional sequential pattern mining algorithms only regard <ab> appearing one time when calculating the support of <ab>, regardless of <ab> appearing at least 4 times within the same data sequence. Taking this repetitive property into consideration can help analysts to grasp more useful information. However, most of the existing algorithms on repetitive sequence mining are used for DNA or genome and cannot be used for mining such repetitive patterns due to the different data properties. Therefore, in this paper, (1) we propose a method to clearly determine the number of times a sequence appears in a data sequence; (2) To solve the problem that the support of a sequence will be more than 100% because the sequence may appear more than one times in a data sequence. We propose a method to ensure the support range of repetitive sequence still within [0,100%] so as to let users set up minimum support threshold in a traditional way; and (3) we propose an algorithm, RptGSP to efficiently mine such repetitive patterns in sequence database by improving the classic algorithm GSP. Experimental results show that RptGSP is very efficient. Repetitive sequential patterns (RSP) mining plays very important roles and has been widely studied in DNA or genome, but there are only a few relevant approaches focusing on mining RSP from sequence database. Taking sequence <bcbcbcbca> for example, traditional sequential pattern mining algorithms only consider that <bc> appears at one time when calculating the support of <bc>, regardless of at least 4 times that <bc> appears within this same data sequence. Accordingly, to catch much more interesting sequential patterns, repetitive property needs to be involved during the mining process. However, currently the most relevant RSP methods focus on DNA analysis considering that they cannot be used for recognizing repetitive patterns on events sequences. Therefore, we propose an approach to determine the number of times a sequence repeatedly makes an appearance in a certain data sequence. The support value of a sequence could be more than 100% as this sequence might repeat in one data sequence, therefore we proposed a strategy to ensure the support range of repetitive sequence still within [0,100%]. Finally, we proposed an efficient algorithm, called RptGSP, to discover such repetitive sequential patterns based on improving GSP Algorithm. The experimental results reveal that RptGSP can efficiently discover the repetitive patterns.

Keywords:

GSP, Repetitive pattern, Sequence database.