<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>scienceoss.com &#187; parsing</title>
	<atom:link href="http://scienceoss.com/tags/parsing/feed/" rel="self" type="application/rss+xml" />
	<link>http://scienceoss.com</link>
	<description>useful tidbits for using open source software in science</description>
	<lastBuildDate>Wed, 26 May 2010 03:34:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Read FASTA files with BioPython</title>
		<link>http://scienceoss.com/read-fasta-files-with-biopython/</link>
		<comments>http://scienceoss.com/read-fasta-files-with-biopython/#comments</comments>
		<pubDate>Thu, 06 Dec 2007 02:30:10 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=49</guid>
		<description><![CDATA[Here are several ways to parse a FASTA file into BioPython Seq objects. Getting a FASTA file into Python is as simple as importing the necessary functions from BioPython, opening the file, and calling a parser on the file. Then you have sequences in BioPython that can be readily used. There are a couple of [...]]]></description>
			<content:encoded><![CDATA[<p>Here are several ways to parse a FASTA file into BioPython <a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc16">Seq</a> objects.</p>
<p>Getting a FASTA file into Python is as simple as importing the necessary functions from BioPython, opening the file, and calling a parser on the file. Then you have sequences in BioPython that can be readily used. There are a couple of different ways to parse the file, depending on your preference. Choices are: <span id="more-49"></span></p>
<ul>
<li><strong>iterator</strong> (saves memory, but only one sequence in memory at a time so you can&#8217;t randomly choose one)</li>
<li><strong>list</strong> (takes up memory, but you can access each sequence using an index)</li>
<li><strong>dictionary</strong> (takes up memory and a few more lines of code to make, but you can access each sequence by its accession number rather than its position in a list)</li>
</ul>
<h2>Do something to each record in a FASTA file (iterator method)</h2>
<p>Use the following code for when you want to operate on each sequence in a fasta file and then move onto the next one. This brings in one sequence at a time, operates on it, then removes it from Python. Very memory efficient, but the sequences are not retained in Python.</p>
<pre class="prettyprint"><code class="code">
from Bio import SeqIO
f = open(FASTfilename)
for i in SeqIO.parse(f,'fasta')
   print i.id
   print i.seq
   print len(i.seq)
f.close()
</code></pre>
<h2>Make a list of <span class="c">Seq</span> objects from a FASTA file (list method)</h2>
<p>This returns a list of sequence objects, <span class="c">seq_list</span>, from the fasta file. This is not as memory efficient as the code above, but the sequences are available in the list after parsing the file.</p>
<pre class="prettyprint"><code class="code">
from Bio import SeqIO
f=open(FASTAfilename)
seq_list=list(SeqIO.parse(f,"fasta")
f.close()
</code></pre>
<h2>Import into a flat database (dictionary method)</h2>
<p>Sometimes it&#8217;s more useful to access a sequence by its accession number rather than by its position in a list. Use the Seq.to_dict() function to do this. However, using this function directly results in the keys of the dictionary being the whole description line of a fasta record (this behavior is different for GenBank files). You can provide your own function (anything that operates on a <span class="c">SeqRecord</span> object). Whatever is returned from this record will be the key for the record in the dictionary. Below is a function to return the accession number of a record.</p>
<pre class="prettyprint"><code class="code">
def get_accession(record):
    """Given a SeqRecord, return the accession number as a string
        e.g. gi|2765613|emb|Z78488.1|PTZ78488 -> Z78488.1    """
    parts=record.id.split("|")
    assert len(parts)==5 and parts[0]=="gi" and parts[2]=="emb"
    return parts[3]
</code></pre>
<p>Then, provide the <span class="c">SeqIO.to_dict()</span> function below with the <span class="c">get_accession()</span> function defined above. The result is a dictionary of sequence objects, keyed by accession (or whatever you told your key function to return &#8212; species perhaps? Or GI number?)</p>
<pre class="prettyprint"><code class="code">
from Bio import SeqIO
f=open(FASTAfilename)
d=SeqIO.to_dict(SeqIO.parse(f,"fasta"),key_function=get_accession)
f.close()
</code></pre>
<p>Now you have a dictionary of FASTA records, keyed by accession number which you can use like so:</p>
<pre class="prettyprint"><code class="code">
d['Z78488.1']  # --> a Seq object
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/read-fasta-files-with-biopython/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parse GenBank files with BioPython</title>
		<link>http://scienceoss.com/parse-genbank-files-with-biopython/</link>
		<comments>http://scienceoss.com/parse-genbank-files-with-biopython/#comments</comments>
		<pubDate>Sun, 02 Dec 2007 22:35:32 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[BioPython]]></category>
		<category><![CDATA[accession]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[GenBank]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[records]]></category>

		<guid isPermaLink="false">http://scienceoss.com/?p=43</guid>
		<description><![CDATA[BioPython handles GenBank files nicely. Here are a couple of ways of getting them into Python and working with them. Single-record GenBank files Use this method if there is only a single record in the GenBank file. If there are multiple records, then use the &#8220;Iterate over several records&#8221; method below. # Read a single [...]]]></description>
			<content:encoded><![CDATA[<p>BioPython handles GenBank files nicely. Here are a couple of ways of getting them into Python and working with them.<span id="more-43"></span></p>
<h3>Single-record GenBank files</h3>
<p>Use this method if there is only a single record in the GenBank file. If there are multiple records, then use the &#8220;Iterate over several records&#8221; method below.</p>
<pre class = "prettyprint"><code class = "code">
# Read a single GenBank record in a file into BioPython

from Bio import GenBank
feature_parser = GenBank.FeatureParser()  #create the parser object
gb_file = "AE017199.gbk"  #specify a genbank file

# Note the parser needs an open file object.
gb_record = feature_parser.parse(open(gb_file,"r"))</code></pre>
<h3>Iterate over several records</h3>
<p>If there are several records in the file, then you can iterate over them. Here&#8217;s how:</p>
<pre class = "prettyprint"><code class = "code">
# Iterate over multiple GenBank records in a single file.

from Bio import GenBank

# open the GenBank file
gb_file = "cor6_6.gb"
gb_handle = open(gb_file, "r")

feature_parser = GenBank.FeatureParser()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
<a name="code1" title="code1" id="code1"></a>
while True:
    rec = gb_iterator.next()  <a href="#note1">#see Note 1</a>
    if rec is None:
        break
    # whatever you want to do to the sequence goes here.
    # In this example, the name, number of features,
    # and sequence itself are printed.
    print "Name: %s, %i features" % (rec.name, len(rec.features))
    print rec.seq
</code></pre>
<p><a name="note1" title="note1" id="note1"></a></p>
<p class="codeNote">Note 1: the next() method grabs the next item in the iterator. If there&#8217;s nothing left in the iterator (that is, it&#8217;s already returned its last item) then it returns a None. Iterators are very memory efficient but need a little extra code to avoid errors.<a href="#code1">back to code</a></p>
<h3>Parse a GenBank file into a dictionary</h3>
<p>By parsing a GenBank file into a dictionary, you can access records by specifying their accession number, like so:</p>
<pre class = "prettyprint"><code class = "code">
#Parse a GenBank file into a Python dictionary

from Bio import SeqIO
handle = open("ls_orchid.gbk")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))
handle.close()
</code></pre>
<h3>Index a GenBank record by protein ID&nbsp;</h3>
<p>This useful function is from <a href="http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/">Peter Cock&#8217;s</a> section on BioPython.</p>
<pre class = "prettyprint"><code class = "code">
def index_genbank_features(gb_record, feature_type, qualifier) :
    answer = dict()
    for (index, feature) in enumerate(gb_record.features):
        if feature.type==feature_type:
            if qualifier in feature.qualifiers:
                for value in feature.qualifiers[qualifier]:
                    if value in answer:
                        print "WARNING - Duplicate key %s \
                                 for %s features %i and %i" % (value,\
                                 feature_type)
                    else:
                        answer[value] = index
    return answer</code></pre>
<p>It&#8217;s used like this:</p>
<pre class = "prettyprint"><code class = "code">
GBindex = index_genbank_features(gb_record,"CDS","protein_id")
print GBindex['AP0001']
</code></pre>
<p>If <span class="c">GBindex['AP0001']</span> is 19, then <span class="c">gb_record[19]</span> is the corresponding record for that protein id. Tie it all together to get the sequence of the protein:</p>
<pre class = "prettyprint"><code class = "code">
gb_record[GBindex['AP0001']].seq
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/parse-genbank-files-with-biopython/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

