Parse GenBank files with BioPython

BioPython handles GenBank files nicely. Here are a couple of ways of getting them into Python and working with them.

Single-record GenBank files

Use this method if there is only a single record in the GenBank file. If there are multiple records, then use the “Iterate over several records” method below.


# Read a single GenBank record in a file into BioPython

from Bio import GenBank
feature_parser = GenBank.FeatureParser()  #create the parser object
gb_file = "AE017199.gbk"  #specify a genbank file

# Note the parser needs an open file object.
gb_record = feature_parser.parse(open(gb_file,"r"))

Iterate over several records

If there are several records in the file, then you can iterate over them. Here’s how:


# Iterate over multiple GenBank records in a single file.

from Bio import GenBank

# open the GenBank file
gb_file = "cor6_6.gb"
gb_handle = open(gb_file, "r")

feature_parser = GenBank.FeatureParser()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)

while True:
    rec = gb_iterator.next()  #see Note 1
    if rec is None:
        break
    # whatever you want to do to the sequence goes here.
    # In this example, the name, number of features,
    # and sequence itself are printed.
    print "Name: %s, %i features" % (rec.name, len(rec.features))
    print rec.seq

Note 1: the next() method grabs the next item in the iterator. If there’s nothing left in the iterator (that is, it’s already returned its last item) then it returns a None. Iterators are very memory efficient but need a little extra code to avoid errors.back to code

Parse a GenBank file into a dictionary

By parsing a GenBank file into a dictionary, you can access records by specifying their accession number, like so:


#Parse a GenBank file into a Python dictionary

from Bio import SeqIO
handle = open("ls_orchid.gbk")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))
handle.close()

Index a GenBank record by protein ID 

This useful function is from Peter Cock’s section on BioPython.


def index_genbank_features(gb_record, feature_type, qualifier) :
    answer = dict()
    for (index, feature) in enumerate(gb_record.features):
        if feature.type==feature_type:
            if qualifier in feature.qualifiers:
                for value in feature.qualifiers[qualifier]:
                    if value in answer:
                        print "WARNING - Duplicate key %s \
                                 for %s features %i and %i" % (value,\
                                 feature_type)
                    else:
                        answer[value] = index
    return answer

It’s used like this:


GBindex = index_genbank_features(gb_record,"CDS","protein_id")
print GBindex['AP0001']

If GBindex['AP0001'] is 19, then gb_record[19] is the corresponding record for that protein id. Tie it all together to get the sequence of the protein:


gb_record[GBindex['AP0001']].seq

Tags: , , , , ,

Leave a Reply