蛋白质序列特征向量计算—数据处理第(4)步

该步骤为数据处理的第(4)步，共包含6小步。
其中前三步：

1. AAC,Amino acid composition(AminoAcidC.py)
2. SEQ,Sequence(Seq.py)
3. eft3,amino acids combination properties(involving kmp algorithm)(Eft3.py)

这前三步用到feature_calc.sh和AminoAcidC.py, Seq.py, Eft3.py, kmp.py, eft3_fts.py共六个脚本文件，
feature_calc.sh是前三步的程序入口，会自动调用五个py脚本。
这前三步，需要在针对T3undrsmp.txt,T4undrsmp.txt,T6undrsmp.txt运行脚本文件时，分别将参数N_term_rsds=25,30,50带入，这样会得到27个和T3aac_N25.txt类似格式的文件。将其分门别类放入9个文件夹，以供后三步使用。

后三步：　

4. AAINDEX(support code/chenzheng)
./encode -i <chen’s file format> -o ../t3_1.aaindex -t aaindex 
5. CKSAAP(support code/chenzheng)
./encode -i <chen’s file format> -o ../t3_1.cksaap -t cksaap 
(Chen’s aaindex cksaap output have a comma at end of each line)
6. PSSM
python t34pssm.py T4undrsmp.txt ./t4 ./t4pssm

第4步和第5步中所需的<chen’s file format>,是T3undrsmp.txt,T4undrsmp.txt,T6undrsmp.txt三个文件经过fasta_chenfmt_std.py脚本文件处理之后得出的T3toChen.txt,T4toChen.txt,T6toChen.txt，这三个文件存放在文件夹Chen'sFormatFile中。
第4步和第5步中所需的encode文件需要将原文件夹中的二进制文件和.o文件删除，然后重新编译运行。具体编译方法，详见encode文件夹中的说明性文件Makefile。
这三步得出来的三个文件，分别对应T3,T4,T6放入前三步的九个文件夹中。这样，每个文件夹中均含有6个文件。
然后，Each of the step output a file of feature vectors AND assemble the CSV file，每6个特征向量文件聚合成一个CSV文件，这个文件包含了六种特征。这一步得出的9个CSV特征文件,在结果文件夹中。

蛋白质序列特征向量计算—数据处理第(4)步

添加新评论

最新文章

最新回复

标签

归档

其他