BioPython can send sequences directly to NCBI and will download the results when they’re ready. Once you have the output, you can parse it to, say, just pull out the hits with the top 10 lowest E values. Or generate a list of the species most frequently found in the results . . . or other useful things.
There are two separate parts to consider when you want to get blast results: running the BLAST search, and parsing the results. If you prefer, you can run a BLAST search on NCBI the normal way through NCBI’s web site. Just make sure to save the results as an XML file. Then you can skip to Part 2.
I’m going to assume you have a sequence of interest called sequence. It should be in a string format (if you have a Seq object, use Seq.data to get the string. If you have a SeqRecord object, use SeqRecord.seq.data to get the string). Need a sequence? Here’s one to play around with.
# Homo sapiens hemoglobin, beta subunit
sequence = """MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH"""
# remove newlines, just in case
sequence = sequence.replace("\n","")
Part 1: Running the BLAST search
The sequence is sent to NCBI’s QBLAST server for a blastp search (the protein version of BLAST) using the non-redundant (nr) database. When the search is done, we get a file-like object in return. The great part: Python takes care of the waiting, and downloads it as soon as it’s ready.
from Bio.Blast import NCBIWWW
# this is the actual blast search, so it may take a moment
blast_handle = NCBIWWW.qblast('blastp', 'nr', sequence)
Done! Now you’re ready for parsing the results (see Part 2 below). The rest of this section goes over optional saving of the blast results to a string or file.
Some optional bits
WARNING: blast_handle is a StringIO object, which is a special kind of string that acts like a file. Like files, StringIO objects only read each line once. So if you were to print blast_handle, it would run through its contents. If you were to try printing it again, you would see nothing, because it has already exhausted its contents by the first printing. How do you fix this? Like so:
# rewind blast_handle back to the beginning
blast_handle.seek(0)
Alternatively, you could save it as a string:
# optional: save the result as a string (rewind to the beginning first)
blast_handle.seek(0)
blast_string = blast_handle.read()
Here’s how to save it as a file on disk:
# optionally save blast_handle as a file for later use
blast_handle.seek(0)
blast_file = open('blast-output.xml', 'w')
blast_file.write(blast_handle.read())
blast_file.close()
Part 2: Analyzing the BLAST results
Open a file or use a StringIO object
If you have an XML file that you’re going to parse (either from the NCBI web site or from saving the results from above), first you need to open the file:
# If you're using a file on disk:
filename = 'blast-output.xml'
parse_me = open(filename)
Otherwise, you can use the blast_handle from above. To keep the code consistent, I’ll set up a parse_me variable:
# If you're using the results directly from the BLAST query:
parse_me = blast_handle
parse_me.seek(0)
Parse it
OK, let’s parse that puppy:
from Bio.Blast import NCBIXML
records = NCBIXML.parse(file_handle)
records is an iterator. Each item corresponds to a record in the BLAST results. If you used the NCBI web site, you could have pasted in several FASTA records. In this case, records will have several records in it. If, on the other hand, you sent a single sequence to QBLAST as in Part 1, then records will only contain a single record.
0 Responses to “Run BLAST from BioPython”