摘要DNA 甲基化在各种生物的组织、细胞中都是普遍存在的,是一个主要发生在胞 嘧啶中的生化过程。通过阻止转录蛋白和基因的结合,它可以影响基因转录,从而抑 制基因的表达。因此,在生命发展和疾病形成的过程中,DNA 甲基化均起着重要作 用,也被认为是一种重要的表观遗传标记。由此可知,识别甲基化位点对生物学基础 研究与药物开发意义重大。在此之前专家们已试用不同方法识别 DNA 甲基化位点, 但或耗时耗力或精度不高。随着人类基因组计划的开展以及各式各样高通量检测技术 的应用,DNA 序列呈现雪崩式的增长,我们迫切需要一种可以高效率准确识别 DNA 甲基化位点的方法。78075
为解决上述问题,本文在有效的基准数据集上,采用支持向量机(SVM)作为 预测引擎。运用统计学方法,提取核苷酸/二核苷酸位置特异性(PSNP/ PSDP)两种 特征向量,并与传统的核苷酸组成成分(NC)特征相结合来编码 DNA 序列。在基准 数据集上采用严格的 Jackknife 测试评价分类器性能。所得实验结果与现有的最好预 测结果相比,将实验结果 Mcc 提高了 23。1%,此数据说明本文 DNA 序列甲基化位点 的整体预测精度确实有显著提高。
毕业论文关键词:DNA 甲基化;核苷酸位置特异性;支持向量机;Jackknife 测试
Abstract DNA methylation is common in various inpidual organisms, tissues and cells, which is a biochemical process predominantly occurring on cytosine。 By impeding the binding of transcriptional proteins to the gene, DNA methylation can affect the transcription of genes, thus inhibiting gene expression。 Therefore, DNA methylation plays an important role in the process of life development and disease formation for epigenetic gene regulation, and it is considered as an important epigenetic mark。 It can be known that the identification of methylation sites is of great significance for both basic research and drug development。 In fact, although a number of methods have been developed in this regard, they are time consuming or low accuracy。 With the development of the human genome project and the application of a wide range of high-throughput detection technology, DNA sequence showed avalanche growth, we urgently need a method that can accurately identify DNA methylation sites。
In order to solve the above problems, in this dissertation, we select effective benchmark data set, using the support vector machine (SVM) as prediction engine。 Adopting statistical methods to extract the feature vectors of DNA sequence。 Two new features, i。e。 position-specific nucleotide/dinucleotide propensity (PSNP/PSDP), are introduced and combined with the traditional nucleotide composition (NC) feature to formulate DNA sequences。 Based on the benchmark datasets, the rigorous jackknife test was used to evaluate the performances of classifier。 By comparing the experimental results with the best predicted results, the value of Mcc is improved by 23。1%, which can prove that the prediction accuracy of DNA sequence methylation sites is significantly improved。
Keywords: DNA methylation; PSNP; SVM; Jackknife test
目 录
第一章 绪论 1
1。1 研究背景及意义 1
1。3 论文主要工作 2
第二章 甲基化位点识别概述 4
2。1 甲基化位点识别流程 4
2。2 基准数据集 4
2。3 特征提取 6
2。3。1 基于统计特征的方法