Parse GenBank files with BioPython
BioPython handles GenBank files nicely. Here are a couple of ways of getting them into Python and working with them.
Single-record GenBank files
Use this method if there is only a single record in the GenBank file. If there are multiple records, then use the “Iterate over several records” method below.
# Read a single GenBank record in a file into BioPython
from Bio import GenBank
feature_parser = GenBank.FeatureParser() #create the parser object
gb_file = "AE017199.gbk" #specify a genbank file
# Note the parser needs an open file object.
gb_record = feature_parser.parse(open(gb_file,"r"))
Iterate over several records
If there are several records in the file, then you can iterate over them. Here’s how:
# Iterate over multiple GenBank records in a single file.
from Bio import GenBank
# open the GenBank file
gb_file = "cor6_6.gb"
gb_handle = open(gb_file, "r")
feature_parser = GenBank.FeatureParser()
gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
while True:
rec = gb_iterator.next() #see Note 1
if rec is None:
break
# whatever you want to do to the sequence goes here.
# In this example, the name, number of features,
# and sequence itself are printed.
print "Name: %s, %i features" % (rec.name, len(rec.features))
print rec.seq
Note 1: the next() method grabs the next item in the iterator. If there’s nothing left in the iterator (that is, it’s already returned its last item) then it returns a None. Iterators are very memory efficient but need a little extra code to avoid errors.back to code
Parse a GenBank file into a dictionary
By parsing a GenBank file into a dictionary, you can access records by specifying their accession number, like so:
#Parse a GenBank file into a Python dictionary
from Bio import SeqIO
handle = open("ls_orchid.gbk")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))
handle.close()
Index a GenBank record by protein ID
This useful function is from Peter Cock’s section on BioPython.
def index_genbank_features(gb_record, feature_type, qualifier) :
answer = dict()
for (index, feature) in enumerate(gb_record.features):
if feature.type==feature_type:
if qualifier in feature.qualifiers:
for value in feature.qualifiers[qualifier]:
if value in answer:
print "WARNING - Duplicate key %s \
for %s features %i and %i" % (value,\
feature_type)
else:
answer[value] = index
return answer
It’s used like this:
GBindex = index_genbank_features(gb_record,"CDS","protein_id")
print GBindex['AP0001']
If GBindex['AP0001'] is 19, then gb_record[19] is the corresponding record for that protein id. Tie it all together to get the sequence of the protein:
gb_record[GBindex['AP0001']].seq
Tags: accession, BioPython, dictionary, GenBank, parsing, records