列出给出序列的crispr 引物

手动寻找cripsr 引物比较麻烦,而现在有些网站可以完成这一任务,但是,用python 去实现它也很简单。以下是脚本:

 1 #!/usr/bin/python
 2 # list all crispr-target(20 bp + NGG)
 3 
 4 import re
 5 from Bio.Seq import reverse_complement   # use Biopython module
 6 
 7 genome_seq = open("seq.txt")
 8 crispr_list = open("crispr_list.txt", "a")    # \'a\' 指的是 append
 9 
10 sequence = ""
11 for line in genome_seq:
12     sequence = sequence + line.rstrip("\n")  # use str.rstrip() method
13 
14 def GC_number(seq):
15     num_G = seq.count(\'G\')
16     num_C = seq.count(\'C\')
17     GC_number = num_G + num_C
18     return GC_number
19 
20 for i in range(0, len(sequence)):
21     target = sequence[i:i+23]
22     
23     if re.search(r\'[ATCG]{21}GG$\', target):    # use regular expression
24     start = i + 1
25     end   = i + 20
26     target = "+ " + target + " from " + str(start) + " to " + str(end) +  "---" + "GC_number = " +  str(GC_number(target))
27     crispr_list.write(target + "\n")     # add \'\n\'
28     #print(target)
29 
30 RC_sequence = reverse_complement(sequence)   # 调用上面的 reverse_complement() method
31 
32 for i in range(0, len(RC_sequence)):
33     target = RC_sequence[i:i+23]
34     
35     if re.search(r\'[ATCG]{21}GG$\', target):
36     start = i + 1
37     end   = i + 20
38     target = "- " + target + " from " + str(start) + " to " + str(end) +  "---" + "GC_number = " +  str(GC_number(target))
39     crispr_list.write(target + "\n")
40     #print(target)
41 
42 genome_seq.close()
43 crispr_list.close()

理论上,这正则表达式更直接和简单,但是以下代码的结果远远少于上面方法的结果:

1 dna = sequence
2 runs = re.findall(r\'[ATCG]{21}GG\', dna)
3 for match in runs:
4     #print(str(match))

目前还不是很清楚原因。

 

版权声明:本文为OA-maque原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/OA-maque/p/4808859.html