DNA序列數(shù)據(jù)挖掘技術(shù)∗
朱揚(yáng)勇1,2+, 熊赟1
1(復(fù)旦大學(xué)計(jì)算機(jī)與信息技術(shù)系,上海 200433)
2(上海生物信息技術(shù)研究中心,上海 201203)
DNA Sequence Data Mining Technique
ZHU Yang-Yong1,2+, XIONG Yun1
1(Department of Computer and Information Technology, Fudan University, Shanghai 200433, China)
2(Shanghai Center for Bioinformation Technology, Shanghai 201203, China)
+ Corresponding author: Phn: +86-21-65642831, Fax: +86-21-65642219, E-mail: yunx@fudan.edu.cn, http://www.dmgroup.org.cn
Zhu YY, Xiong Y. DNA sequence data mining technique. Journal of Software, 2007,18(11):2766−2781. http://www.jos.org.cn/1000-9825/18/2766.htm
Abstract: DNA sequence is one of the basic and important data among biological data. Researching DNA sequence data and then comprehending life essential is a necessary task in post-genomic era. At present, data mining technique is one of the most efficient data analysis means, which finds out information hidden in data. It has also become main data analysis technique adopted in Bioinformatics. It has been applied in DNA sequence analysis, which has got wide attention and rapid development. And considerable research achievements have emerged. Provides an overview of research progress in DNA sequence data mining field. In more detail, it proposes three research phases including statistics-based data mining methods application, general data mining methods application, and specialized DNA sequence-oriented data mining methods design, and then elaborates that sequence similarity is foundation of DNA sequence data mining technique. It also analyzes and comments some key techniques in this field by combining with biological background, such as DNA sequential pattern, association, clustering, classification and outlier mining. Finally, future work and open issues are given, including the research of a novel storage model and index methods, the design of data mining algorithm based on biological domain knowledge.
Key words: DNA sequence; data mining; bioinformatics; sequential pattern; sequence similarity
摘 要: DNA序列數(shù)據(jù)是一類重要的生物數(shù)據(jù).研究DNA序列數(shù)據(jù)解讀其含義是后基因組時(shí)代的主要研究任務(wù).數(shù)據(jù)挖掘是目前最有效的數(shù)據(jù)分析手段之一,用于發(fā)現(xiàn)大量數(shù)據(jù)所隱含的各種規(guī)律,也是生物信息學(xué)采用的主要數(shù)據(jù)分析技術(shù).將數(shù)據(jù)挖掘技術(shù)用于DNA序列數(shù)據(jù)分析,已得到了廣泛關(guān)注和快速發(fā)展,并取得了許多研究成果.綜述了DNA序列數(shù)據(jù)挖掘領(lǐng)域的研究狀況和進(jìn)展,提出了3個(gè)研究階段:基于統(tǒng)計(jì)的挖掘方法應(yīng)用階段、一般化挖掘方法應(yīng)用階段和專門的DNA序列數(shù)據(jù)挖掘方法設(shè)計(jì)階段.闡述了DNA序列數(shù)據(jù)挖掘的基礎(chǔ)是序列相似性,評(píng)述了
∗ Supported by the National Natural Science Foundation of China under Grant No.60573093 (國(guó)家自然科學(xué)基金); the National High-Tech Research and Development Plan of China under Grant No.2006AA02Z329 (國(guó)家高技術(shù)研究發(fā)展計(jì)劃(863))
Received 2007-01-23; Accepted 2007-04-25
朱揚(yáng)勇 等:DNA 序列數(shù)據(jù)挖掘技術(shù) 2767
DNA序列數(shù)據(jù)挖掘領(lǐng)域所采用的關(guān)鍵技術(shù),包括DNA序列模式、關(guān)聯(lián)、聚類、分類和異常挖掘等,分析討論了其相應(yīng)的生物應(yīng)用背景和意義.最后給出DNA序列數(shù)據(jù)挖掘進(jìn)一步研究的熱點(diǎn)問題,包括DNA |
|