Overview

上一篇文章python计算smoothed PSSM(一)当中,介绍了以当前氨基酸残基为基点,左右取相同数目的序列,然后叠加计算。Chris介绍,这样的算法有特定的用场:蛋白质后修饰。但是,普通的蛋白质序列提取特征就不太适用了:因为窗口值(smoothed window)只能取奇数,而如果有偶数长度的序列片段包含有特征,这种算法就会漏掉。于是决定写一个新的python脚本,把所有特征全部包含进去。
想法很简单:以当前残基为基点,直接向后连续取w_smth个氨基酸,并叠加计算最后存入新矩阵的当前位置。我做了一个循环,将窗口值不大于w_smth的矩阵全部存入一个相同的新矩阵当中,这样特征就全了。

1 python编码

1.1 t34pssm.py

这部分代码跟前面的代码只有一处不同:不用判断窗口值是否为奇数。这部分内容可参考上一篇内容。

1.2 pssm_smoothed_head.py

这部分代码完成的功能如下:

  1. 将每条序列的pssm矩阵的左半部分截取,存入矩阵PSSM-orig
  2. 对矩阵PSSM_orig进行叠加操作,生成矩阵PSSM_smth_head_full
  3. 根据需要截取PSSM_smth_head_full的前n个序列,并存入PSSM_smth_head_final
  4. PSSM_smth_head_final合并为一行写入文件,每条序列占一行。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
 
"""
Retrieve smoothed_head PSSM features
"""
import sys
import numpy as np
import math
import re
import fileinput
 
def pssm_smth_head_n(fi,output_smth_head,w_smth_head,n):
    # 0-19 represents amino acid 'ARNDCQEGHILKMFPSTWYV'
    w_smth_head=int(w_smth_head)
    n=int(n)
    Amino_vec = "ARNDCQEGHILKMFPSTWYV"
 
    PSSM = []
    PSSM_orig = [] 
    seq_cn = 0
 
    # 读取pssm文件
    for line, strin in enumerate(fileinput.input(fi)):
     
        if line > 2:
            str_vec = strin.split()[1:22]
            if len(str_vec) == 0:
                break
            PSSM.append(map(int, str_vec[1:]))
            seq_cn += 1
    print seq_cn           
    fileinput.close()
 
    #original PSSM
    #将每条序列的`pssm`矩阵的左半部分截取,存入矩阵`PSSM-orig`
    PSSM_orig=np.array(PSSM)
    #print PSSM_orig
 
    PSSM_smth_head_final=np.array([[0.0]*20]*(n*w_smth_head))
 
    #section for PSSM_smth_head features
    for k in range(1,w_smth_head+1):
        PSSM_smth_head = np.array([[0.0]*20]*seq_cn)
        #print PSSM_smth_head
 
        #对矩阵`PSSM_orig`进行叠加操作,生成矩阵`PSSM_smth_head_full`。                                                                                                           
        PSSM_smth_head_full=pssm_smth_head(PSSM_orig,PSSM_smth_head,k,seq_cn)   
        #print PSSM_smth_head_full
        #print np.shape(PSSM_smth_head_full)
 
        #根据需要截取`PSSM_smth_head_full`的前`n`个序列,并存入`PSSM_smth_head_final`。
        for i in range(n):
            PSSM_smth_head_final[i+n*(k-1)]=PSSM_smth_head_full[i]         
    #print PSSM_smth_head_final
    PSSM_smth_head_final_shp=np.shape(PSSM_smth_head_final)    
    file_out_smth_head=file(output_smth_head,'a')
 
    #将`PSSM_smth_head_final`合并为一行写入文件,每条序列占一行
    np.savetxt(file_out_smth_head, [np.reshape(PSSM_smth_head_final, (PSSM_smth_head_final_shp[0] * PSSM_smth_head_final_shp[1], ))], delimiter=",")
 
def pssm_smth_head(PSSM_orig,PSSM_smth_head,w_smth_head,l):
    for i in range(l):
        if i <=l-w_smth_head:
            for j in range(i,i+w_smth_head):
                         
                PSSM_smth_head[i]+=PSSM_orig[j]            
        else:
            for j in range(i,l):
                PSSM_smth_head[i]+=PSSM_orig[j]    
    return PSSM_smth_head

1.3 总结

这个程序可以得到比较全的特征。比如取窗口值为10,那么窗口值为1,2,3,4,5,6,7,8,9,10的特征将会全部被包含在内,并合成一行。
linux中进入文件所在的目录,然后终端运行如下命令:

1
python t34pssm.py T4undrsmp.txt ./t4 ./t4pssm  w_smth n

需要保存的结果文件,自己修改。