Monthly Archive for March, 2008

MySQLdb – accessing MySQL databases from Python

MySQL is a popular open-source database engine, and Python interfaces quite nicely with MySQL with the MySQLdb package. For more on why you would want to use a database for your data, check out this post. Here I’ll show you how to connect to your existing MySQL database with Python. Continue reading ‘MySQLdb – accessing MySQL databases from Python’

Restructure or reformat dataframes in R with melt

The basic idea is to take an R dataframe like this one containing abundance of three species at each site, and elevation at each site

site  sp1 sp2 sp3 elev
a      3   4   9   100
b      1   8   10  210
c      4   8   15  165

and reorganize into something like this (perhaps so we can do an ANOVA using species as a factor):

site  elev  sp  abundance
a     100  sp1  3
a     100  sp2  4
a     100  sp3  9
b     210  sp1  1
b     210  sp2  8
b     210  sp3  10
c     165  sp1  4
c     165  sp2  8
c     165  sp3  15

Assuming the first dataframe above is called d, the second dataframe can be obtained using the following code:

> library(ggplot2)
> m = melt(d, id=c('site','elev'))

melt works like this: You specify the ID variables, which are those variables that will REMAIN as dataframe variables. Any others will be considered measured variables. If it’s easier for your data, you can do it the other way: specify the measured variables and the others will be considered ID variables.

Melting results in two new variables, variable and value. variable contains the names of the original columns of the dataframe as factors, and value contains the corresponding values.

Another example

Here’s another example using the built-in dataset, airquality. First, unmelted:

# make all the variable names lowercase
names(airquality) <- tolower(names(airquality))
head(airquality)
  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

and melted:

> head(melt(airquality,id=c('month','day')))
  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
5     5   5    ozone    NA
6     5   6    ozone    28

Reorder factors for ggplot

A somewhat contrived example . . . first, illustrate the problem:

library(ggplot2)
ggplot(iris)+aes(x=Sepal.Width)+geom_histogram()+facet_grid(Species~.)

Unsorted histogram
How do we get these histograms to be better sorted? The following will reorder the factor variable, Species, by the mean of Sepal.Width:

iris$Species = reorder(iris$Species, iris$Sepal.Width, mean)
ggplot(iris)+aes(x=Sepal.Width)+geom_histogram()+facet_grid(Species~.)

Sorted histogram
Now the histograms are sorted by the mean sepal width.

Why should I use a database for my data?

Quick answer: when you have large amounts of inter-related data.

Long answer: Continue reading ‘Why should I use a database for my data?’

Convert None to NaN for use in NumPy arrays

I originally had the function below to convert values of None into values of NaN, but realized it is much faster and easier to coerce the array type:

x.astype(float)

Here’s the slow way to do it:

from numpy import array, nan
def None2NaN(x):
    """Converts None objects to nan, for use in
       NumPy arrays. Returns an array."""
    newlist = []
    for i in x:
        if i is not None:
            newlist.append(i)
        else:
            newlist.append(nan)
    return array(newlist)

“Vectorization” in NumPy

How do you get Matlab-like vectorization when using NumPy? The key is using parentheses when using logical operators. Here’s an example:

from numpy import *

a = array([1,2,3,4,5])
b = array([5,4,3,2,1])
c = array([1,2,2,2,1])

# ===The right way===

(a<5) & (b>3) & (c==2)
# array([ False,  True, False, False, False], dtype=bool)

# ===The wrong way===

a<5 & b>3 & c==2
# ValueError:
# The truth value of an array with more than
# one element is ambiguous. Use a.any() or a.all()

Alternatively, use NumPy’s logical_and and logical_or functions . . . but these functions only take two arguments at a time.

“unexpected indent” errors in Python

If you’ve used Python, you know that whitespace is used to delimit blocks of code.

But whitespace comes in two different flavors: tabs and spaces, and you have to pay attention to which one you are using: Continue reading ‘“unexpected indent” errors in Python’

Run a Matlab script from the shell or from another program

Yeah, I know Matlab is not open source, but this is useful to know anyway:

You can run Matlab in batch mode and pass a script to it from the system shell. If the m-file xyz.m is in the current directory, then run it from the system shell with:

matlab -nodesktop -nosplash -nojvm -r "xyz"

Note that there is no .m extension on xyz in this command . . . if you do put the extension you will get an error, “??? Undefined variable “xyz”" or class “xyz.m”.

Check the type of array in Numpy

from numpy import array
a = array([1,2,3])
a.dtype  # dtype('int32')
a.dtype.kind  # 'i', for 'integer'

s = array(['a','b','c'])
s.dtype  # dtype('|S1')
s.dtype.kind  # 'S' for 'string'

f = array([1., 2., 3.])
f.dtype  # dtype('float64')
f.dtype.kind  # 'f' for 'float'

Add or delete columns in a table in MySQL

Add a column to a table in the database:


ALTER TABLE table_name
ADD column_name datatype

Delete a column from a table in the database:


ALTER TABLE table_name
DROP COLUMN column_name