<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>scienceoss.com &#187; BioPython</title>
	<atom:link href="http://scienceoss.com/categories/python/biopython/feed/" rel="self" type="application/rss+xml" />
	<link>http://scienceoss.com</link>
	<description>useful tidbits for using open source software in science</description>
	<lastBuildDate>Wed, 26 May 2010 03:34:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Check the type of array in Numpy</title>
		<link>http://scienceoss.com/check-the-type-of-array-in-numpy/</link>
		<comments>http://scienceoss.com/check-the-type-of-array-in-numpy/#comments</comments>
		<pubDate>Mon, 17 Mar 2008 00:55:39 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[array]]></category>
		<category><![CDATA[type]]></category>

		<guid isPermaLink="false">http://scienceoss.com/check-the-type-of-array-in-numpy/</guid>
		<description><![CDATA[from numpy import array a = array([1,2,3]) a.dtype # dtype('int32') a.dtype.kind # 'i', for 'integer' s = array(['a','b','c']) s.dtype # dtype('&#124;S1') s.dtype.kind # 'S' for 'string' f = array([1., 2., 3.]) f.dtype # dtype('float64') f.dtype.kind # 'f' for 'float']]></description>
			<content:encoded><![CDATA[<pre class="prettyprint"><code class="code">from numpy import array
a = array([1,2,3])
a.dtype  # dtype('int32')
a.dtype.kind  # 'i', for 'integer'

s = array(['a','b','c'])
s.dtype  # dtype('|S1')
s.dtype.kind  # 'S' for 'string'

f = array([1., 2., 3.])
f.dtype  # dtype('float64')
f.dtype.kind  # 'f' for 'float'</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/check-the-type-of-array-in-numpy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Return a list of codons from a sequence</title>
		<link>http://scienceoss.com/return-a-list-of-codons-from-a-sequence/</link>
		<comments>http://scienceoss.com/return-a-list-of-codons-from-a-sequence/#comments</comments>
		<pubDate>Thu, 10 Jan 2008 15:44:57 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[codons]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=58</guid>
		<description><![CDATA[There are no built-in functions (that I know of) for returning a list of codons from a sequence, but making your own is quite simple. This function is from Solution A.5 from the Python Course in Bioinformatics def codons(s, frame=0): """Return a list of codons from a string, s, giving an optional frameshift.""" codons=[] end=len(s[frame:]) [...]]]></description>
			<content:encoded><![CDATA[<p>There are no built-in functions (that I know of) for returning a list of codons from a sequence, but making your own is quite simple.  This function is from Solution A.5 from the Python Course in Bioinformatics</p>
<pre class = "prettyprint"><code class = "code">
def codons(s, frame=0):
    """Return a list of codons from a string, s, giving an optional frameshift."""

    codons=[]

    end=len(s[frame:]) - (len(s[frame:]) % 3) - 1

    for i in range(frame,end,3):
        codons.append(s[i:i+3])

    return codons
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/return-a-list-of-codons-from-a-sequence/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A quick codon table</title>
		<link>http://scienceoss.com/a-quick-codon-table/</link>
		<comments>http://scienceoss.com/a-quick-codon-table/#comments</comments>
		<pubDate>Thu, 06 Dec 2007 02:40:25 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[codons]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=50</guid>
		<description><![CDATA[Sometimes it&#8217;s nice to be able to have a codon table handy. Rather than typing out one by hand, theTranslate module contains the codon table of your choice in dictionary form: Note 1: See the NCBI Translation Tables for the correctid number to use for your sequence. And here&#8217;s how to easily reverse it, where [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes it&#8217;s nice to be able to have a codon table handy. Rather than typing out one by hand, theTranslate module contains the codon table of your choice in dictionary form:<span id="more-50"></span></p>
<pre class="brush: python; title: ; notranslate">
from Bio import Translate
translater=Translate.unambiguous_dna_by_id[11]
bacterial_table=translater.table.forward_table
</pre>
<p><a name="note1" title="note1" id="note1"></a><br />
<strong>Note 1:</strong> See the <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi">NCBI Translation Tables</a> for the correctid number to use for your sequence.</p>
<pre class="brush: python; title: ; notranslate">
bacterial_table['TAT']
&gt;&gt;&gt;'Y'
</pre>
<p>And here&#8217;s how to easily reverse it, where the keys are amino acids and the values are lists of codons:</p>
<pre class="brush: python; title: ; notranslate">
back_table = {}
for key,value in bacterial_table.iteritems():
    try:
        back_table[value].append(key)
    except KeyError:
        back_table[value]=[key]
</pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/a-quick-codon-table/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Read FASTA files with BioPython</title>
		<link>http://scienceoss.com/read-fasta-files-with-biopython/</link>
		<comments>http://scienceoss.com/read-fasta-files-with-biopython/#comments</comments>
		<pubDate>Thu, 06 Dec 2007 02:30:10 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=49</guid>
		<description><![CDATA[Here are several ways to parse a FASTA file into BioPython Seq objects. Getting a FASTA file into Python is as simple as importing the necessary functions from BioPython, opening the file, and calling a parser on the file. Then you have sequences in BioPython that can be readily used. There are a couple of [...]]]></description>
			<content:encoded><![CDATA[<p>Here are several ways to parse a FASTA file into BioPython <a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16">Seq</a> objects.</p>
<p>Getting a FASTA file into Python is as simple as importing the necessary functions from BioPython, opening the file, and calling a parser on the file. Then you have sequences in BioPython that can be readily used. There are a couple of different ways to parse the file, depending on your preference. Choices are: <span id="more-49"></span></p>
<ul>
<li><strong>iterator</strong> (saves memory, but only one sequence in memory at a time so you can&#8217;t randomly choose one)</li>
<li><strong>list</strong> (takes up memory, but you can access each sequence using an index)</li>
<li><strong>dictionary</strong> (takes up memory and a few more lines of code to make, but you can access each sequence by its accession number rather than its position in a list)</li>
</ul>
<h2>Do something to each record in a FASTA file (iterator method)</h2>
<p>Use the following code for when you want to operate on each sequence in a fasta file and then move onto the next one. This brings in one sequence at a time, operates on it, then removes it from Python. Very memory efficient, but the sequences are not retained in Python.</p>
<pre class="prettyprint"><code class="code">
from Bio import SeqIO
f = open(FASTfilename)
for i in SeqIO.parse(f,'fasta')
   print i.id
   print i.seq
   print len(i.seq)
f.close()
</code></pre>
<h2>Make a list of <span class="c">Seq</span> objects from a FASTA file (list method)</h2>
<p>This returns a list of sequence objects, <span class="c">seq_list</span>, from the fasta file. This is not as memory efficient as the code above, but the sequences are available in the list after parsing the file.</p>
<pre class="prettyprint"><code class="code">
from Bio import SeqIO
f=open(FASTAfilename)
seq_list=list(SeqIO.parse(f,"fasta")
f.close()
</code></pre>
<h2>Import into a flat database (dictionary method)</h2>
<p>Sometimes it&#8217;s more useful to access a sequence by its accession number rather than by its position in a list. Use the Seq.to_dict() function to do this. However, using this function directly results in the keys of the dictionary being the whole description line of a fasta record (this behavior is different for GenBank files). You can provide your own function (anything that operates on a <span class="c">SeqRecord</span> object). Whatever is returned from this record will be the key for the record in the dictionary. Below is a function to return the accession number of a record.</p>
<pre class="prettyprint"><code class="code">
def get_accession(record):
    """Given a SeqRecord, return the accession number as a string
        e.g. gi|2765613|emb|Z78488.1|PTZ78488 -> Z78488.1    """
    parts=record.id.split("|")
    assert len(parts)==5 and parts[0]=="gi" and parts[2]=="emb"
    return parts[3]
</code></pre>
<p>Then, provide the <span class="c">SeqIO.to_dict()</span> function below with the <span class="c">get_accession()</span> function defined above. The result is a dictionary of sequence objects, keyed by accession (or whatever you told your key function to return &#8212; species perhaps? Or GI number?)</p>
<pre class="prettyprint"><code class="code">
from Bio import SeqIO
f=open(FASTAfilename)
d=SeqIO.to_dict(SeqIO.parse(f,"fasta"),key_function=get_accession)
f.close()
</code></pre>
<p>Now you have a dictionary of FASTA records, keyed by accession number which you can use like so:</p>
<pre class="prettyprint"><code class="code">
d['Z78488.1']  # --> a Seq object
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/read-fasta-files-with-biopython/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Working with BioPython Sequences</title>
		<link>http://scienceoss.com/working-with-biopython-sequences/</link>
		<comments>http://scienceoss.com/working-with-biopython-sequences/#comments</comments>
		<pubDate>Mon, 03 Dec 2007 02:26:36 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=45</guid>
		<description><![CDATA[Much of BioPython uses Seq objects for dealing with sequences of all kinds. Here&#8217;s how to create, get the complement, transcribe, and translate sequences, either from scratch or from a FASTA or GenBank file. Create a sequence This is the most basic way of getting a sequence. Usually sequences will come in as GenBank or [...]]]></description>
			<content:encoded><![CDATA[<p>Much of BioPython uses <span class="c">Seq</span> objects for dealing with sequences of all kinds. Here&#8217;s how to create, get the complement, transcribe, and translate sequences, either from scratch or from a FASTA or GenBank file.<span id="more-45"></span></p>
<h2>Create a sequence</h2>
<p>This is the most basic way of getting a sequence. Usually sequences will come in as GenBank or Fasta files, but it&#8217;s useful to know how to create a sequence from scratch.</p>
<p>Sequences need data (the sequence) and an alphabet. The alphabet is what distinguishes a DNA sequence from an amino acid sequence or RNA sequence (see <a href="#seqalpha">sequence alphabets</a> below).</p>
<h2>Create a nucleotide sequence from scratch</h2>
<p>Note that once a sequence is created, it cannot be changed.  This is to ensure that anything the code that manipulates sequences will not change them.  If you do want to change a sequence (say, change the first A to a T), then you must create a Mutable Seq</p>
<pre class="prettyprint"><code class="code">
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

s = Seq('ACGTACTGGCATGTGCA', IUPAC.unambiguous_dna)
</code></pre>
<p>What makes this sequence a DNA sequence is the <span class="c">unambiguous_dna</span> alphabet.  BioPython now knows ACGT stands for the nucleotides (adenine, cytosine, guanine, and thymine), not the amino acids (alanine, cysteine, glycine, and threonine).</p>
<h2>Create a protein sequence from scratch</h2>
<p>As you might guess, same as the nucleotide sequence, but use a protein alphabet instead.</p>
<pre class="prettyprint"><code class="code">
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

s = Seq('ACGTACTGGCATGTGCA',IUPAC.protein)
</code></pre>
<p>To create another kind of sequence, use another alphabet (see <a href="#seqalpha">sequence alphabets</a> below for other alphabets).</p>
<h2>Get the complement of a sequence</h2>
<p>After creating the sequence, s, from above, use the <span class="c">complement()</span> method of the <span class="c">Seq</span> object.</p>
<pre class="prettyprint"><code class="code">
s.complement()
</code></pre>
<pre class="output">
<strong>Output:</strong>
Seq('TGCATGACCGTACACGT', IUPACUnambiguousDNA())
</code></pre>
<p>Note that the resulting sequence has the same alphabet as the original sequence.</p>
<h2>Transcribe a sequence</h2>
<p>The BioPython <span class="c">Transcribe</span> package can transcribe sequences.</p>
<pre class="prettyprint"><code class="code">
from Bio import Transcribe
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

trans=Transcribe.unambiguous_transcriber

s = Seq('ACGTACTGGCATGTGCA',IUPAC.unambiguous_dna)

rna = trans.transcribe(s)
</code></pre>
<p>The output, <span class="c">rna</span>, is a <span class="c">Seq</span> object:</p>
<pre class="output">
<strong>Output:</strong>
Seq('ACGUACUGGCAUGUGCA', IUPACUnambiguousRNA())
</code></pre>
<p>There is also an <span class="c">ambiguous_dna</span> transcriber.</p>
<h2>Translate a sequence</h2>
<p>Here&#8217;s how to translate a</p>
<pre class="prettyprint"><code class="code">
from Bio import Translate
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

trans=Translate.unambiguous_dna_by_id[11]
s = Seq('ACGTACTGGCATGTGCA',IUPAC.unambiguous_dna)
pep=trans.translate(s)
</code></pre>
<h2>Sequence Alphabets<a name="seqalpha" title="seqalpha" id="seqalpha"></a></h2>
<p>When working with BioPython sequences, sometimes you have to specify the alphabet of the sequence. The alphabets for creating sequences (or setting the alphabet of an imported FASTA sequence) can be found in the <span class="c">Bio.Alphabet.IUPAC</span> module. IUPAC stands for the International Union of Applied and Pure Chemistry, an organization which defined standard nomenclature for things like nucleic acids and amino acids (references <a href="http://www.iupac.org/reports/1980/5209lide/index.html">here</a>).</p>
<pre class="prettyprint"><code class="code">
from Bio.Alphabet import IUPAC
my_alphabet = IUPAC.unambiguous_dna
</code></pre>
<ul>
<li class="c">unambiguous_dna</li>
<li style="list-style: none">
<ul>
<li class="c">GATC</li>
</ul>
</li>
<li class="c">ambiguous_dna</li>
<li style="list-style: none">
<ul>
<li class="c">GATCRYWSMKHBVDN</li>
</ul>
</li>
<li class="c">protein</li>
<li style="list-style: none">
<ul>
<li class="c">ACDEFGHIKLMNPQRSTVWY</li>
</ul>
</li>
<li class="c">extended_protein</li>
<li style="list-style: none">
<ul>
<li class="c">ACDEFGHIKLMNPQRSTVWYBXZ</li>
</ul>
</li>
<li class="c">unambiguous_rna</li>
<li style="list-style: none">
<ul>
<li class="c">GAUC</li>
</ul>
</li>
<li class="c">ambiguous_rna</li>
<li style="list-style: none">
<ul>
<li class="c">GAUCRYWSMKHBVDN</li>
</ul>
</li>
<li class="c">extended_dna</li>
<li style="list-style: none">
<ul>
<li class="c">GATCBDSW</li>
</ul>
</li>
</ul>
<p>see <a href="http://www.pasteur.fr/recherche/unites/sis/formation/python/apas06.html#f_alphabet_class">class diagram of BioPython alphabets</a> for more details.</p>
<p>There is also support for a reduced alphabet (<span class="c">from Bio.Alphabet import Reduce</span>). Check the source code for useful documentation on this alphabet, which combines physiochemically similar amino acids into a single letter.</p>
<h2>Examples of using Seq objects</h2>
<pre class="prettyprint"><code class="code">
s = Seq('ACGTACTGGCATGTGCA',IUPAC.unambiguous_dna)

# length of the sequence
len(s)

# number of A's
s.count('A')

# GC content
GC = s.count('G')+s.count('C')
GC_percent = float(GC) / len(s) * 100

# a subsequence
s[3:6]
</code></pre>
<p>For more advanced operations, the <span class="c">Seq</span> object needs to be converted to a string.  Luckily, the <span class="c">tostring()</span> method of a <span class="c">Seq</span> object does just that.</p>
<pre class = "prettyprint"><code class = "code">
# find all occurences where EITHER G or C is found before T

import re
matches = re.findall('[GC]T',s.tostring())
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/working-with-biopython-sequences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Run BLAST from BioPython</title>
		<link>http://scienceoss.com/run-blast-from-biopython/</link>
		<comments>http://scienceoss.com/run-blast-from-biopython/#comments</comments>
		<pubDate>Sun, 02 Dec 2007 23:35:21 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=44</guid>
		<description><![CDATA[BioPython can send sequences directly to NCBI and will download the results when they&#8217;re ready. Once you have the output, you can parse it to, say, just pull out the hits with the top 10 lowest E values. Or generate a list of the species most frequently found in the results . . . or [...]]]></description>
			<content:encoded><![CDATA[<p>BioPython can send sequences directly to NCBI and will download the results when they&#8217;re ready.  Once you have the output, you can parse it to, say, just pull out the hits with the top 10 lowest E values.  Or generate a list of the species most frequently found in the results . . . or other useful things.<span id="more-44"></span></p>
<p>There are two separate parts to consider when you want to get blast results: running the BLAST search, and parsing the results. If you prefer, you can run a BLAST search on NCBI the normal way through NCBI&#8217;s web site. Just make sure to save the results as an XML file. Then you can skip to Part 2.</p>
<p>I&#8217;m going to assume you have a sequence of interest called <span class="c">sequence</span>. It should be in a string format (if you have a <span class="c">Seq</span> object, use <span class="c">Seq.data</span> to get the string. If you have a <span class="c">SeqRecord</span> object, use <span class="c">SeqRecord.seq.data</span> to get the string). Need a sequence? Here&#8217;s one to play around with.</p>
<pre class="prettyprint"><code class="code">
# Homo sapiens hemoglobin, beta subunit
sequence = """MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH"""

# remove newlines, just in case
sequence = sequence.replace("\n","")
</code></pre>
<h2>Part 1: Running the BLAST search</h2>
<p>The sequence is sent to NCBI&#8217;s QBLAST server for a blastp search (the protein version of BLAST) using the non-redundant (nr) database. When the search is done, we get a file-like object in return. The great part: Python takes care of the waiting, and downloads it as soon as it&#8217;s ready.</p>
<pre class="prettyprint"><code class="code">
from Bio.Blast import NCBIWWW

# this is the actual blast search, so it may take a moment
blast_handle = NCBIWWW.qblast('blastp', 'nr', sequence)
</code></pre>
<p><strong>Done!</strong> Now you&#8217;re ready for parsing the results (see Part 2 below). The rest of this section goes over optional saving of the blast results to a string or file.</p>
<h3>Some optional bits</h3>
<p><strong>WARNING:</strong> <span class="c">blast_handle</span> is a <span class="c">StringIO</span> object, which is a special kind of string that acts like a file. Like files, <span class="c">StringIO</span> objects only read each line once. So if you were to print <span class="c">blast_handle</span>, it would run through its contents. If you were to try printing it again, you would see nothing, because it has already exhausted its contents by the first printing. How do you fix this? Like so:</p>
<pre class="prettyprint"><code class="code">
# rewind blast_handle back to the beginning
blast_handle.seek(0)
</code></pre>
<p>Alternatively, you could save it as a string:</p>
<pre class="prettyprint"><code class="code">
# optional: save the result as a string (rewind to the beginning first)
blast_handle.seek(0)
blast_string = blast_handle.read()
</code></pre>
<p>Here&#8217;s how to save it as a file on disk:</p>
<pre class="prettyprint"><code class="code">
# optionally save blast_handle as a file for later use
blast_handle.seek(0)
blast_file = open('blast-output.xml', 'w')
blast_file.write(blast_handle.read())
blast_file.close()
</code></pre>
<h2>Part 2: Analyzing the BLAST results</h2>
<h3>Open a file or use a StringIO object</h3>
<p>If you have an XML file that you&#8217;re going to parse (either from the NCBI web site or from saving the results from above), first you need to open the file:</p>
<pre class="prettyprint"><code class="code">
# If you're using a file on disk:
filename = 'blast-output.xml'
parse_me = open(filename)
</code></pre>
<p>Otherwise, you can use the blast_handle from above. To keep the code consistent, I&#8217;ll set up a parse_me variable:</p>
<pre class="prettyprint"><code class="code">
# If you're using the results directly from the BLAST query:
parse_me = blast_handle
parse_me.seek(0)
</code></pre>
<h3>Parse it</h3>
<p>OK, let&#8217;s parse that puppy:</p>
<pre class="prettyprint"><code class="code">
from Bio.Blast import NCBIXML
records = NCBIXML.parse(file_handle)
</code></pre>
<p><span class="c">records</span> is an iterator. Each item corresponds to a record in the BLAST results. If you used the NCBI web site, you could have pasted in several FASTA records. In this case, <span class="c">records</span> will have several records in it. If, on the other hand, you sent a single sequence to QBLAST as in Part 1, then <span class="c">records</span> will only contain a single record.</p>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/run-blast-from-biopython/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parse GenBank files with BioPython</title>
		<link>http://scienceoss.com/parse-genbank-files-with-biopython/</link>
		<comments>http://scienceoss.com/parse-genbank-files-with-biopython/#comments</comments>
		<pubDate>Sun, 02 Dec 2007 22:35:32 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[accession]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[GenBank]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[records]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=43</guid>
		<description><![CDATA[BioPython handles GenBank files nicely. Here are a couple of ways of getting them into Python and working with them. Single-record GenBank files Use this method if there is only a single record in the GenBank file. If there are multiple records, then use the &#8220;Iterate over several records&#8221; method below. # Read a single [...]]]></description>
			<content:encoded><![CDATA[<p>BioPython handles GenBank files nicely. Here are a couple of ways of getting them into Python and working with them.<span id="more-43"></span></p>
<h3>Single-record GenBank files</h3>
<p>Use this method if there is only a single record in the GenBank file. If there are multiple records, then use the &#8220;Iterate over several records&#8221; method below.</p>
<pre class = "prettyprint"><code class = "code">
# Read a single GenBank record in a file into BioPython

from Bio import GenBank
feature_parser = GenBank.FeatureParser()  #create the parser object
gb_file = "AE017199.gbk"  #specify a genbank file

# Note the parser needs an open file object.
gb_record = feature_parser.parse(open(gb_file,"r"))</code></pre>
<h3>Iterate over several records</h3>
<p>If there are several records in the file, then you can iterate over them. Here&#8217;s how:</p>
<pre class = "prettyprint"><code class = "code">
# Iterate over multiple GenBank records in a single file.

from Bio import GenBank

# open the GenBank file
gb_file = "cor6_6.gb"
gb_handle = open(gb_file, "r")

feature_parser = GenBank.FeatureParser()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
<a name="code1" title="code1" id="code1"></a>
while True:
    rec = gb_iterator.next()  <a href="#note1">#see Note 1</a>
    if rec is None:
        break
    # whatever you want to do to the sequence goes here.
    # In this example, the name, number of features,
    # and sequence itself are printed.
    print "Name: %s, %i features" % (rec.name, len(rec.features))
    print rec.seq
</code></pre>
<p><a name="note1" title="note1" id="note1"></a></p>
<p class="codeNote">Note 1: the next() method grabs the next item in the iterator. If there&#8217;s nothing left in the iterator (that is, it&#8217;s already returned its last item) then it returns a None. Iterators are very memory efficient but need a little extra code to avoid errors.<a href="#code1">back to code</a></p>
<h3>Parse a GenBank file into a dictionary</h3>
<p>By parsing a GenBank file into a dictionary, you can access records by specifying their accession number, like so:</p>
<pre class = "prettyprint"><code class = "code">
#Parse a GenBank file into a Python dictionary

from Bio import SeqIO
handle = open("ls_orchid.gbk")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))
handle.close()
</code></pre>
<h3>Index a GenBank record by protein ID&nbsp;</h3>
<p>This useful function is from <a href="http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/">Peter Cock&#8217;s</a> section on BioPython.</p>
<pre class = "prettyprint"><code class = "code">
def index_genbank_features(gb_record, feature_type, qualifier) :
    answer = dict()
    for (index, feature) in enumerate(gb_record.features):
        if feature.type==feature_type:
            if qualifier in feature.qualifiers:
                for value in feature.qualifiers[qualifier]:
                    if value in answer:
                        print "WARNING - Duplicate key %s \
                                 for %s features %i and %i" % (value,\
                                 feature_type)
                    else:
                        answer[value] = index
    return answer</code></pre>
<p>It&#8217;s used like this:</p>
<pre class = "prettyprint"><code class = "code">
GBindex = index_genbank_features(gb_record,"CDS","protein_id")
print GBindex['AP0001']
</code></pre>
<p>If <span class="c">GBindex['AP0001']</span> is 19, then <span class="c">gb_record[19]</span> is the corresponding record for that protein id. Tie it all together to get the sequence of the protein:</p>
<pre class = "prettyprint"><code class = "code">
gb_record[GBindex['AP0001']].seq
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/parse-genbank-files-with-biopython/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

