Overview
上一篇文章python计算smoothed PSSM(一)当中,介绍了以当前氨基酸残基为基点,左右取相同数目的序列,然后叠加计算。Chris
介绍,这样的算法有特定的用场:蛋白质后修饰。但是,普通的蛋白质序列提取特征就不太适用了:因为窗口值(smoothed window)
只能取奇数,而如果有偶数长度的序列片段包含有特征,这种算法就会漏掉。于是决定写一个新的python
脚本,把所有特征全部包含进去。
想法很简单:以当前残基为基点,直接向后连续取w_smth
个氨基酸,并叠加计算最后存入新矩阵的当前位置。我做了一个循环,将窗口值不大于w_smth
的矩阵全部存入一个相同的新矩阵当中,这样特征就全了。
1 python编码
1.1 t34pssm.py
这部分代码跟前面的代码只有一处不同:不用判断窗口值是否为奇数。这部分内容可参考上一篇内容。
1.2 pssm_smoothed_head.py
这部分代码完成的功能如下:
- 将每条序列的
pssm
矩阵的左半部分截取,存入矩阵PSSM-orig
。 - 对矩阵
PSSM_orig
进行叠加操作,生成矩阵PSSM_smth_head_full
。 - 根据需要截取
PSSM_smth_head_full
的前n
个序列,并存入PSSM_smth_head_final
。 - 将
PSSM_smth_head_final
合并为一行写入文件,每条序列占一行。
代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | #! /usr/bin/env python # -*- coding: utf-8 -*- # vim:fenc=utf-8 """ Retrieve smoothed_head PSSM features """ import sys import numpy as np import math import re import fileinput def pssm_smth_head_n(fi,output_smth_head,w_smth_head,n): # 0-19 represents amino acid 'ARNDCQEGHILKMFPSTWYV' w_smth_head = int (w_smth_head) n = int (n) Amino_vec = "ARNDCQEGHILKMFPSTWYV" PSSM = [] PSSM_orig = [] seq_cn = 0 # 读取pssm文件 for line, strin in enumerate (fileinput. input (fi)): if line > 2 : str_vec = strin.split()[ 1 : 22 ] if len (str_vec) = = 0 : break PSSM.append( map ( int , str_vec[ 1 :])) seq_cn + = 1 print seq_cn fileinput.close() #original PSSM #将每条序列的`pssm`矩阵的左半部分截取,存入矩阵`PSSM-orig` PSSM_orig = np.array(PSSM) #print PSSM_orig PSSM_smth_head_final = np.array([[ 0.0 ] * 20 ] * (n * w_smth_head)) #section for PSSM_smth_head features for k in range ( 1 ,w_smth_head + 1 ): PSSM_smth_head = np.array([[ 0.0 ] * 20 ] * seq_cn) #print PSSM_smth_head #对矩阵`PSSM_orig`进行叠加操作,生成矩阵`PSSM_smth_head_full`。 PSSM_smth_head_full = pssm_smth_head(PSSM_orig,PSSM_smth_head,k,seq_cn) #print PSSM_smth_head_full #print np.shape(PSSM_smth_head_full) #根据需要截取`PSSM_smth_head_full`的前`n`个序列,并存入`PSSM_smth_head_final`。 for i in range (n): PSSM_smth_head_final[i + n * (k - 1 )] = PSSM_smth_head_full[i] #print PSSM_smth_head_final PSSM_smth_head_final_shp = np.shape(PSSM_smth_head_final) file_out_smth_head = file (output_smth_head, 'a' ) #将`PSSM_smth_head_final`合并为一行写入文件,每条序列占一行 np.savetxt(file_out_smth_head, [np.reshape(PSSM_smth_head_final, (PSSM_smth_head_final_shp[ 0 ] * PSSM_smth_head_final_shp[ 1 ], ))], delimiter = "," ) def pssm_smth_head(PSSM_orig,PSSM_smth_head,w_smth_head,l): for i in range (l): if i < = l - w_smth_head: for j in range (i,i + w_smth_head): PSSM_smth_head[i] + = PSSM_orig[j] else : for j in range (i,l): PSSM_smth_head[i] + = PSSM_orig[j] return PSSM_smth_head |
1.3 总结
这个程序可以得到比较全的特征。比如取窗口值为10
,那么窗口值为1,2,3,4,5,6,7,8,9,10
的特征将会全部被包含在内,并合成一行。
在linux
中进入文件所在的目录,然后终端运行如下命令:
1 | python t34pssm.py T4undrsmp.txt . / t4 . / t4pssm w_smth n |
需要保存的结果文件,自己修改。