蛋白质序列特征提取方法之——CKSAAP

Overview

在CKSAAP(Compositon of k-spaced Amino Acid Pairs)方法中，利用在蛋白质序列片断中k个间隔距离的残基对(residue pairs)在该序列中的组成比例，建立数学模型，提取出特征向量，从而达到预测泛素(Ubiquitin)的目的。
残基(residue)和泛素(Ubiquitin)信息详见维基百科：残基和泛素，这里就不赘述了。
本文主要介绍该方法建立数学计算模型的思路及编程方案。

1.建模思路

以长度为48的序列LEEYRKHVAERAAEGIAPKPLDANQMAALVELLKNPPAGEEEFLLDLL为例(如果你不懂这些英文字母什么含义，请点击氨基酸简写符号)，当k=0时，我们需要提取的残基对(residue pairs)为{LE,EE,EY,……,LD,DL,LL}，即每个氨基酸和它相邻的下一个氨基酸组成一对提取出来，也就是说这两个氨基酸中间的间隔距离是k=0个氨基酸。以此类推，当k=1时，我们需要提取的残基对为{LE,EY,ER,YK,……,LD,LL,DL}……本方法中，k最大为5.
细心的你一定会发现k=0时候，结尾会有1个氨基酸没有配对，而提取的残基对数量为47；k=1时，有2个氨基酸没有配对，而提取的残基对数量为46；所以，规律就是，当序列长度为N，间隔为k时，一共可以提取的残基对数量为N-k-1，记为

$N_{Total}=N-k-1$

这里要特别感谢一下Chris同学的提醒和点拨。
由于基本氨基酸数量为20，故而可以形成的残基对数量是20×20=400.我们统计的是这些残基对在这个蛋白质序列当中出现的概率，于是便产生了一个400维的特征向量，即

$(\frac{N_{LE}}{N_{Total}},\frac{N_{EE}}{N_{Total}},\frac{N_{EY}}{N_{Total}},……,\frac{N_{DL}}{N_{Total}},\frac{N_{LL}}{N_{Total}})_{400}$

需要注意的是，本方法中设定的N为窗口值为window=27，即截取的片段长度为27(具体的截取方法将在下面给出)。可以想象，组成这个向量的400个比值当中，绝大部分为0.0，而这些比值总和为1.所以，当k分别取0,1,2,3,4,5等值时，一共会有400×6=2400个向量。

2.编程方案

encode这个文件夹中的算法，涵盖了数种主流的算法，这些算法介绍均会相继给出。

2.1 参数初始化

data.h这个文件中给出了许多初始值：

const int buffer=256;
 
// encode type
char encodeTypeInit[buffer]="cksaap";
char *encodeType=encodeTypeInit;
 
// input file name
char *inputFile;
 
// output file name
char outputFileInit[buffer]="output.txt";
char *outputFile=outputFileInit;
int outputFormat=0;
 
// pssm file directory
char pssmFileDirectoryInit[buffer]="PSSM";
char *pssmFileDirectory=pssmFileDirectoryInit;
 
// disorder file directory
char disorderFileDirectoryInit[buffer]="DisorderVSL2";
char *disorderFileDirectory=disorderFileDirectoryInit;
 
// aggregation file directory
char aggFileDirectoryInit[buffer]="Aggregation";
char *aggFileDirectory=aggFileDirectoryInit;
 
// AAindex parameter file
char aaindexFileInit[buffer]="AAindex.txt";
char *aaindexFile=aaindexFileInit;
char ifFeatureSelectionInit[buffer]="N";
char *ifFeatureSelection=ifFeatureSelectionInit;
char selectedFeatureFileInit[buffer]="SelectedAAindexFeature.txt";
char *selectedFeatureFile=selectedFeatureFileInit;
 
// KNN
char knnTrainFileInit[buffer]="train.txt";
char *knnTrainFile=knnTrainFileInit;
char topKFileInit[buffer]="topKValues.txt";
char *topKFile=topKFileInit;

以及截取蛋白质序列片段的参数(seqLen)和窗口值(window)

int seqLen=51;
int window=27;

这些参数中的大部分可以从程序入口函数修改设定，encode.cc这个c++文件是程序的入口。

2.2 参数输入

以我们的项目来说，

1	`./encode -i <chen’s file format> -o ../t3_1.cksaap -t cksaap`

这句Linux命令，提供了input，output和encode type，而下面的语句则是判断input，output和encode type的作用。

// input
if(strcmp(argv[i], "-i")==0){
    inputFile=argv[i+1];
}
 
// output
if(strcmp(argv[i], "-o")==0){
    outputFile=argv[i+1];
}
 
// encode type
if(strcmp(argv[i], "-t")==0){
    encodeType=argv[i+1];
}

而源代码里面还有判断其他选项的模块，诸如output format之类。当然，seqLen和window的值也可以在命令行中直接修改为自己所需要的值。

// Fragment and encode window
if(strcmp(argv[i], "-L") == 0){
    seqLen=atoi(argv[i+1]);
}
if(strcmp(argv[i], "-W") == 0){
    window=atoi(argv[i+1]);
}

输入文件input加载进内存，以我们的文件T3toChen.txt的第一行内容为例,

ifstream icin(inputFile);
if(!icin){
    printMessage("Can not open the input file!");
}
string tmpStr;
while(!icin.eof()){//如果没有到文件结尾，则执行循环体
    getline(icin, tmpStr);//一行一行读取，相当于Java中的readline()
    if(!tmpStr.empty()){
        string str1, str2, str3;
        int tmpTag=0;//<chen’s file format>每一行均由三个string和一个int组成，这四个值用于存储这四个量
        istringstream iStream(tmpStr);
        iStream>>str1>>str2>>str3>>tmpTag;
        /*
        str1=="LEEYRKHVAERAAEGIAPKPLDANQMAALVELLKNPPAGEEEFLLDLLTNRVPPG
                VDEAAYVKAGFLAAIAKGEAKSPLLTPEKAIELLGTMQGGYNIHP"
        str2=="56479606"
        str3=="any"
        tmpTag==-1
        */
        str3.replace(0, 1, "");//str3=="ny"
        struct prot *p=new prot;
        p->next=NULL;
        p->protSeq=str1;
        p->protName=str2;
        p->position=atoi(str3.c_str());//p->position==0;
        p->tag=tmpTag=-1;
        if(head==NULL && tail==NULL){
            head=tail=p;
        }
        else{
            tail->next=p;
            tail=p;
        }
    }
}
icin.close();
icin.clear();

由于encodeType==cksaap，

if(strcmp(encodeType, "cksaap") == 0){
    CKSAAPEncode_2(head, outputFile, ifFeatureSelection, selectedFeatureFile, outputFormat, seqLen, window);
}

故调用encordFunctions.h文件中的函数

void CKSAAPEncode_2(struct prot *data, char *sbjFile, char *featureSelection, char *selectedFeature, int outFormat, int fragLen, int encodeWindow){
    //函数体(较长),主要作用是统计残基对并求出出现的概率。
}

这几个参数中，

data=head=p;
sbjFile=outputFile＝t3_1.cksaap;
featureSelection=ifFeatureSelection="N";
selectedFeature="SelectedAAindexFeature.txt";
outFormat=0;
fragLen=51;
encodeWindow=27.

2.3 特征计算

char AA[]={'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'};
 
map<char, int> mymap;
for(int i=0; i<20; i++){
    mymap.insert(pair<char, int>(AA[i], i));
}
 
struct prot *head=data;
if(strcmp(featureSelection, "N") == 0){//满足条件
    if(outFormat==0){//满足条件
        while(head != NULL){//满足条件
            int m=1;
            if(head->tag == 1){
                ofssub<<"+1  ";
            }
            else{//满足条件
                ofssub<<"-1  ";
            }
            string seq=head->protSeq.substr((int)((fragLen/2)-int(encodeWindow/2)), encodeWindow);
            //这一句是从何处截取蛋白质序列片段的核心语句。
            /*
            对于本例子来讲,从第12个氨基酸开始，截取27个。即
            seq="AEGIAPKPLDANQMAALVELLKNPPAG".
            */
            for(int gap=0; gap<=5; gap++){
                float Num[400];//400维的数组，存储特征残基对。
                int sum=0;//存储残基对个数。
                for(int i=0; i<400; i++){ Num[i]=0; }
 
                for(int i=0; i<seq.length()-gap-1; i++){
                    if(seq.at(i)=='-'  || seq.at(i+gap+1)=='-') continue;
                    Num[mymap[seq.at(i)]*20 + mymap[seq.at(i+gap+1)]]++;//将20个氨基酸看做20进制的数，然后配对，构思巧妙。
                    sum++;
                }
 
                for(int i=0; i<400; i++){
                    ofssub<<m<<":"<<Num[i]/sum<<"  ";//输出特征向量到输出文件。
                    m++;//分量的维度。
                }
            }
            ofssub<<endl;
            head=head->next;
        }
    }
    else if(outFormat == 1){
        ofssub<<"@relation features\n\n";
        for(int i=1; i<=2400; i++){
            ofssub<<"@attribute f"<<i<<" real\n";
        }
        ofssub<<"@attribute play {yes, no}\n\n";
        ofssub<<"@data\n";
 
        while(head != NULL){
            string seq=head->protSeq.substr((int)((fragLen/2)-int(encodeWindow/2)), encodeWindow);
            for(int gap=0; gap<=5; gap++){
                float Num[400];
                int sum=0;
                for(int i=0; i<400; i++){ Num[i]=0; }
 
                for(int i=0; i<seq.length()-gap-1; i++){
                    if(seq.at(i)=='-' || seq.at(i+gap+1)=='-') continue;
                    Num[mymap[seq.at(i)]*20 + mymap[seq.at(i+gap+1)]]++;
                    sum++;
                }
 
                for(int i=0; i<400; i++){
                    ofssub<<Num[i]/sum<<",";
                }
            }
            if(head->tag == 1){
                ofssub<<"yes\n";
            }
            else{
                ofssub<<"no\n";
            }
            head=head->next;
        }
    }
    else{
        printMessage("Incorrect output file format.");
    }
}

这样，经过6次循环，这个序列就产生了一个特征向量，维度为2400，这个特征向量输出到t3_1.cksaap这个文件当中。
根据各种不同的需要，可以在命令行中或者直接在data.h文件中改变参数的初始值，以达到不同的效果。
这个算法更多细节，请点击cksaap_ubsite。

由于水平和时间有限，难免存在理解的偏差及疏漏，不足之处请联系我们。

已有 4 条评论

luke

March 12th, 2018 at 09:09 pm

Hi博主，请问你知道CKSAAP最初是由谁提出的吗？

回复
1. Chris
  
  March 13th, 2018 at 08:37 am
  
  Hi, 我能追溯到最早的论文是Chenzhen博士在中国农业大学读博士时首先提出使用的，主要用在位点预测，请参考下面两篇论文：
  http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0022930
  https://www.ncbi.nlm.nih.gov/pubmed/23603789
  CKSAAP以及变种也在别的领域被使用，比如III和IV型分泌蛋白预测：
  http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0056632
  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4483310/
  https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbx164/4665693
  比如β-Lactamases预测：
  https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1972-6
  
  就不一一举例了，如果想了解更多，可以在pubmed中搜索"composition of k-spaced amino acid pairs"，结果链接：
  https://www.ncbi.nlm.nih.gov/pubmed/?term=composition+of+k-spaced+amino+acid+pairs
  
  回复
  1. Chris
    
    March 14th, 2018 at 05:37 pm
    
    更正一下，能追溯到的最早论文是Chen Ke在University of Alberta读博士时提出使用的，可以参考下面的文章：
    https://bmcstructbiol.biomedcentral.com/articles/10.1186/1472-6807-7-25
    
    回复
wendao

September 8th, 2021 at 03:31 pm

赞！

回复