Working with BioPython Sequences
Much of BioPython uses Seq objects for dealing with sequences of all kinds. Here’s how to create, get the complement, transcribe, and translate sequences, either from scratch or from a FASTA or GenBank file.
Create a sequence
This is the most basic way of getting a sequence. Usually sequences will come in as GenBank or Fasta files, but it’s useful to know how to create a sequence from scratch.
Sequences need data (the sequence) and an alphabet. The alphabet is what distinguishes a DNA sequence from an amino acid sequence or RNA sequence (see sequence alphabets below).
Create a nucleotide sequence from scratch
Note that once a sequence is created, it cannot be changed. This is to ensure that anything the code that manipulates sequences will not change them. If you do want to change a sequence (say, change the first A to a T), then you must create a Mutable Seq
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
s = Seq('ACGTACTGGCATGTGCA', IUPAC.unambiguous_dna)
What makes this sequence a DNA sequence is the unambiguous_dna alphabet. BioPython now knows ACGT stands for the nucleotides (adenine, cytosine, guanine, and thymine), not the amino acids (alanine, cysteine, glycine, and threonine).
Create a protein sequence from scratch
As you might guess, same as the nucleotide sequence, but use a protein alphabet instead.
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
s = Seq('ACGTACTGGCATGTGCA',IUPAC.protein)
To create another kind of sequence, use another alphabet (see sequence alphabets below for other alphabets).
Get the complement of a sequence
After creating the sequence, s, from above, use the complement() method of the Seq object.
s.complement()
Output:
Seq('TGCATGACCGTACACGT', IUPACUnambiguousDNA())
Note that the resulting sequence has the same alphabet as the original sequence.
Transcribe a sequence
The BioPython Transcribe package can transcribe sequences.
from Bio import Transcribe
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
trans=Transcribe.unambiguous_transcriber
s = Seq('ACGTACTGGCATGTGCA',IUPAC.unambiguous_dna)
rna = trans.transcribe(s)
The output, rna, is a Seq object:
Output:
Seq('ACGUACUGGCAUGUGCA', IUPACUnambiguousRNA())
There is also an ambiguous_dna transcriber.
Translate a sequence
Here’s how to translate a
from Bio import Translate
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
trans=Translate.unambiguous_dna_by_id[11]
s = Seq('ACGTACTGGCATGTGCA',IUPAC.unambiguous_dna)
pep=trans.translate(s)
Sequence Alphabets
When working with BioPython sequences, sometimes you have to specify the alphabet of the sequence. The alphabets for creating sequences (or setting the alphabet of an imported FASTA sequence) can be found in the Bio.Alphabet.IUPAC module. IUPAC stands for the International Union of Applied and Pure Chemistry, an organization which defined standard nomenclature for things like nucleic acids and amino acids (references here).
from Bio.Alphabet import IUPAC
my_alphabet = IUPAC.unambiguous_dna
- unambiguous_dna
-
- GATC
- ambiguous_dna
-
- GATCRYWSMKHBVDN
- protein
-
- ACDEFGHIKLMNPQRSTVWY
- extended_protein
-
- ACDEFGHIKLMNPQRSTVWYBXZ
- unambiguous_rna
-
- GAUC
- ambiguous_rna
-
- GAUCRYWSMKHBVDN
- extended_dna
-
- GATCBDSW
see class diagram of BioPython alphabets for more details.
There is also support for a reduced alphabet (from Bio.Alphabet import Reduce). Check the source code for useful documentation on this alphabet, which combines physiochemically similar amino acids into a single letter.
Examples of using Seq objects
s = Seq('ACGTACTGGCATGTGCA',IUPAC.unambiguous_dna)
# length of the sequence
len(s)
# number of A's
s.count('A')
# GC content
GC = s.count('G')+s.count('C')
GC_percent = float(GC) / len(s) * 100
# a subsequence
s[3:6]
For more advanced operations, the Seq object needs to be converted to a string. Luckily, the tostring() method of a Seq object does just that.
# find all occurences where EITHER G or C is found before T
import re
matches = re.findall('[GC]T',s.tostring())
Tags: BioPython