Read FASTA files with BioPython

Here are several ways to parse a FASTA file into BioPython Seq objects.

Getting a FASTA file into Python is as simple as importing the necessary functions from BioPython, opening the file, and calling a parser on the file. Then you have sequences in BioPython that can be readily used. There are a couple of different ways to parse the file, depending on your preference. Choices are:

  • iterator (saves memory, but only one sequence in memory at a time so you can’t randomly choose one)
  • list (takes up memory, but you can access each sequence using an index)
  • dictionary (takes up memory and a few more lines of code to make, but you can access each sequence by its accession number rather than its position in a list)

Do something to each record in a FASTA file (iterator method)

Use the following code for when you want to operate on each sequence in a fasta file and then move onto the next one. This brings in one sequence at a time, operates on it, then removes it from Python. Very memory efficient, but the sequences are not retained in Python.


from Bio import SeqIO
f = open(FASTfilename)
for i in SeqIO.parse(f,'fasta')
   print i.id
   print i.seq
   print len(i.seq)
f.close()

Make a list of Seq objects from a FASTA file (list method)

This returns a list of sequence objects, seq_list, from the fasta file. This is not as memory efficient as the code above, but the sequences are available in the list after parsing the file.


from Bio import SeqIO
f=open(FASTAfilename)
seq_list=list(SeqIO.parse(f,"fasta")
f.close()

Import into a flat database (dictionary method)

Sometimes it’s more useful to access a sequence by its accession number rather than by its position in a list. Use the Seq.to_dict() function to do this. However, using this function directly results in the keys of the dictionary being the whole description line of a fasta record (this behavior is different for GenBank files). You can provide your own function (anything that operates on a SeqRecord object). Whatever is returned from this record will be the key for the record in the dictionary. Below is a function to return the accession number of a record.


def get_accession(record):
    """Given a SeqRecord, return the accession number as a string
        e.g. gi|2765613|emb|Z78488.1|PTZ78488 -> Z78488.1    """
    parts=record.id.split("|")
    assert len(parts)==5 and parts[0]=="gi" and parts[2]=="emb"
    return parts[3]

Then, provide the SeqIO.to_dict() function below with the get_accession() function defined above. The result is a dictionary of sequence objects, keyed by accession (or whatever you told your key function to return — species perhaps? Or GI number?)


from Bio import SeqIO
f=open(FASTAfilename)
d=SeqIO.to_dict(SeqIO.parse(f,"fasta"),key_function=get_accession)
f.close()

Now you have a dictionary of FASTA records, keyed by accession number which you can use like so:


d['Z78488.1']  # --> a Seq object

Tags: ,

Leave a Reply