您正在查看: 2016年3月

数据挖掘之数据标准化(Normalization)

Overview

在数据挖掘工作中,往往需要对得到的特征向量进行数据预处理。其中有重要的一步就是标准化(Normalization),也叫“归一化”。其目的就是为了放置得到的多个特征量纲差距过大,从而影响精度,而且也能是模型收敛速度加快。而归一化的方式一般有下面两种。更多详细内容参考维基百科:Normalization (statistics)

1.1 重新缩放法

这个是最简单的标准化方法,而且是线性的。公式如下:

$$ X^{'}=a+\left ( b-a \right )\frac{X-X_{min}}{X_{max}-X_{min}} $$

其中 $a$ 和 $b$ 分别表示缩放的下限和上限,输入数据和输出数据,原矩阵中最小值和最大值分别由 $X,X^{'},X_{min},X_{max}$ 表示。

1.2 标准化方法

标准化方法有一个好处:python里面的numpy包中有求矩阵均值,标准差等统计参数的函数,很方便。

$$ x^{'}=\frac{x-\bar{x}}{\sigma } $$

其中,$x,\bar{x},\sigma ,x^{'}$ 分别表示原矩阵中需要归一化的输入数据,原矩阵数据的均值,原矩阵数据的标准差,归一化后的输出数据。

python计算smoothed PSSM(二)

Overview

上一篇文章python计算smoothed PSSM(一)当中,介绍了以当前氨基酸残基为基点,左右取相同数目的序列,然后叠加计算。Chris介绍,这样的算法有特定的用场:蛋白质后修饰。但是,普通的蛋白质序列提取特征就不太适用了:因为窗口值(smoothed window)只能取奇数,而如果有偶数长度的序列片段包含有特征,这种算法就会漏掉。于是决定写一个新的python脚本,把所有特征全部包含进去。
想法很简单:以当前残基为基点,直接向后连续取w_smth个氨基酸,并叠加计算最后存入新矩阵的当前位置。我做了一个循环,将窗口值不大于w_smth的矩阵全部存入一个相同的新矩阵当中,这样特征就全了。

1 python编码

1.1 t34pssm.py

这部分代码跟前面的代码只有一处不同:不用判断窗口值是否为奇数。这部分内容可参考上一篇内容。

1.2 pssm_smoothed_head.py

这部分代码完成的功能如下:

  1. 将每条序列的pssm矩阵的左半部分截取,存入矩阵PSSM-orig
  2. 对矩阵PSSM_orig进行叠加操作,生成矩阵PSSM_smth_head_full
  3. 根据需要截取PSSM_smth_head_full的前n个序列,并存入PSSM_smth_head_final
  4. PSSM_smth_head_final合并为一行写入文件,每条序列占一行。

代码如下:

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    # vim:fenc=utf-8
    
    """
    Retrieve smoothed_head PSSM features 
    """
    import sys
    import numpy as np
    import math
    import re
    import fileinput 

    def pssm_smth_head_n(fi,output_smth_head,w_smth_head,n):
        # 0-19 represents amino acid 'ARNDCQEGHILKMFPSTWYV'
        w_smth_head=int(w_smth_head)
        n=int(n)
        Amino_vec = "ARNDCQEGHILKMFPSTWYV"

        PSSM = []
        PSSM_orig = []  
        seq_cn = 0

        # 读取pssm文件
        for line, strin in enumerate(fileinput.input(fi)):
        
            if line > 2:
                str_vec = strin.split()[1:22]
                if len(str_vec) == 0:
                    break
                PSSM.append(map(int, str_vec[1:]))
                seq_cn += 1
        print seq_cn            
        fileinput.close()

        #original PSSM
        #将每条序列的`pssm`矩阵的左半部分截取,存入矩阵`PSSM-orig`
        PSSM_orig=np.array(PSSM)
        #print PSSM_orig
    
        PSSM_smth_head_final=np.array([[0.0]*20]*(n*w_smth_head))

        #section for PSSM_smth_head features
        for k in range(1,w_smth_head+1):
            PSSM_smth_head = np.array([[0.0]*20]*seq_cn)
            #print PSSM_smth_head

            #对矩阵`PSSM_orig`进行叠加操作,生成矩阵`PSSM_smth_head_full`。                                                                                                            
            PSSM_smth_head_full=pssm_smth_head(PSSM_orig,PSSM_smth_head,k,seq_cn)    
            #print PSSM_smth_head_full
            #print np.shape(PSSM_smth_head_full)

            #根据需要截取`PSSM_smth_head_full`的前`n`个序列,并存入`PSSM_smth_head_final`。
            for i in range(n):
                PSSM_smth_head_final[i+n*(k-1)]=PSSM_smth_head_full[i]          
        #print PSSM_smth_head_final
        PSSM_smth_head_final_shp=np.shape(PSSM_smth_head_final)     
        file_out_smth_head=file(output_smth_head,'a')

        #将`PSSM_smth_head_final`合并为一行写入文件,每条序列占一行
        np.savetxt(file_out_smth_head, [np.reshape(PSSM_smth_head_final, (PSSM_smth_head_final_shp[0] * PSSM_smth_head_final_shp[1], ))], delimiter=",")

    def pssm_smth_head(PSSM_orig,PSSM_smth_head,w_smth_head,l):
        for i in range(l):
            if i <=l-w_smth_head:
                for j in range(i,i+w_smth_head):
                            
                    PSSM_smth_head[i]+=PSSM_orig[j]             
            else:
                for j in range(i,l):
                    PSSM_smth_head[i]+=PSSM_orig[j]     
        return PSSM_smth_head

1.3 总结

这个程序可以得到比较全的特征。比如取窗口值为10,那么窗口值为1,2,3,4,5,6,7,8,9,10的特征将会全部被包含在内,并合成一行。
linux中进入文件所在的目录,然后终端运行如下命令:

    python t34pssm.py T4undrsmp.txt ./t4 ./t4pssm  w_smth n 

需要保存的结果文件,自己修改。

python计算smoothed PSSM(一)

Overview

最近几天,Chris和我看了很多论文,对PSSM有了更深的认识。但是,鉴于PSSM本身包含单个位置的信息更明显,而几乎没有包含蛋白质序列片段信息,我们两人思考如何将蛋白质序列片段信息编码,终于找到了一种PSSM的处理方式,这种方式叫做smoothed window,特此记录一下。
该算法原理,请参考这篇论文:Predicting RNA-binding sites of proteins using support vector machines and evolutionary information,并在此感谢该论文作者!并感谢Chris对我的鼓励和帮助!

1 python编码

1.1 t34pssm.py

这部分代码是主函数,是之前这篇文章中的代码,可以参考这里蛋白质序列特征提取方法之——PSSM,就不详加解释了。

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    # vim:fenc=utf-8
    """
        python t34pssm.py total_train_60.fasta ./total_train_60 ./total_train_60_pssm  w_smth n 
        param:
            1. 总的fasta格式蛋白质序列
            2. 分开的fasta格式蛋白质序列的文件夹 
            3. 分开的fasta格式蛋白质序列对应的pssm的文件夹  
            4.smooth-window 值,要求是奇数。
            5.截取的序列长度,一般为25,30,50。本例为30.
    """
    import fileinput
    import sys
    from os import listdir
    from os.path import isfile, join
    import re
    from pssm_smoothed import *
    
    smplfasta = sys.argv[1]  
    spfasta = sys.argv[2]   
    check_head = re.compile(r'\>')  
    
    #read from undersample fasta, store 
    smplist = []
    smpcnt = 0
    for line, strin in enumerate(fileinput.input(smplfasta)):
        if check_head.match(strin):
            smplist.append(strin.strip())
            smpcnt += 1
    onlyfiles = [ f for f in listdir(spfasta) if isfile(join(spfasta,f)) ]

    fastaDict = {}

    for fi in onlyfiles:
        cntnt = ''
        for line, strin in enumerate(fileinput.input(spfasta+'/'+fi)):
            if line == 0:
                cntnt += strin.strip()
        if cntnt in fastaDict:
            print strin
        fastaDict[cntnt] = fi

    pssmdir = sys.argv[3]

    w_smth = sys.argv[4]
    #如果窗口值不是奇数,退出程序
    if int(w_smth)%2 ==0:
        print 'Please change your input argument ' + w_smth + ' to an odd smoothing-window number!!!'
        sys.exit(1)
    n=sys.argv[5]

    for smp in smplist:
        finalist.append(pssmdir+'/'+fastaDict[smp].split('.')[0]+'.pssm')

    for fi in finalist: 
        pssm_single(fi,'total_train_60_pssm_smth',w_smth,n)

1.2 pssm_smoothed.py

    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    # vim:fenc=utf-8
    """
    Retrieve smoothed PSSM features 
    """
    import sys
    import numpy as np
    import math
    import re
    import fileinput 
    def pssm_single(fi,output_smth,w_smth,n):
        # 0-19 represents amino acid 'ARNDCQEGHILKMFPSTWYV'
        w_smth=int(w_smth)
        n=int(n)
        Amino_vec = "ARNDCQEGHILKMFPSTWYV"

        PSSM = []
        PSSM_orig = []  
        seq_cn = 0 
        # 读取pssm文件
        for line, strin in enumerate(fileinput.input(fi)):
            if line > 2:
            str_vec = strin.split()[1:22]
            if len(str_vec) == 0:
                break
            PSSM.append(map(int, str_vec[1:]))
            seq_cn += 1
        print seq_cn            
        fileinput.close()
        PSSM_smth = np.array([[0.0]*20]*seq_cn)
        #print PSSM_smth
        PSSM_orig=np.array(PSSM)
        #print PSSM_orig
        #section for PSSM_smth features
        PSSM_smth_full=pssm_smth(PSSM_orig,PSSM_smth,w_smth,seq_cn)
        PSSM_smth_final=[[0.0]*20]*n

        #截取PSSM_smth_full矩阵的前n行,作为输出内容
        for i in range(n):
            PSSM_smth_final[i]=PSSM_smth_full[i]
            #print PSSM_smth_final[i]
        PSSM_smth_final_shp=np.shape(PSSM_smth_final)
        # for i in range(seq_cn):
        #   print PSSM_smth_final[i]
        file_out_smth=file(output_smth,'a')
        np.savetxt(file_out_smth, [np.reshape(PSSM_smth_final, (PSSM_smth_final_shp[0] * PSSM_smth_final_shp[1], ))], delimiter=",")

    #这个函数会求出整条序列的smoothed pssm矩阵
    def pssm_smth(PSSM_orig,PSSM_smth,w_smth,l):
        for i in range(l):
            #smooth窗口超过pssm上边界
            if i <(w_smth-1)/2:
                for j in range(i+(w_smth-1)/2+1):
                    #print i,j              
                    PSSM_smth[i]+=PSSM_orig[j]
                    #print PSSM_smth[i]
            #smooth窗口超过pssm下边界
            elif i>=(l-(w_smth-1)/2):
                for j in range(i-(w_smth-1)/2,l):   
                    #print i,j          
                    PSSM_smth[i]+=PSSM_orig[j]
                    #print PSSM_smth[i]
            else:
                for j in range(i-(w_smth-1)/2,i+(w_smth-1)/2+1):
                    #print i,j
                    PSSM_smth[i]+=PSSM_orig[j]
                    #print PSSM_smth[i]     
        return PSSM_smth

1.3 总结

该算法以行为单位进行运算。以本行为中心,上下扩展,扩展的上下长度为smooth窗口值,将这些行的值加起来,存入新的矩阵的相同行位置。这样新的矩阵就包含了多个连续氨基酸序列片段的信息,会为特征提取提供新的思路。