Posts Tagged ‘data management’

MySQLdb - accessing MySQL databases from Python

Monday, March 24th, 2008

MySQL is a popular open-source database engine, and Python interfaces quite nicely with MySQL with the MySQLdb package. For more on why you would want to use a database for your data, check out this post. Here I’ll show you how to connect to your existing MySQL database with Python. (more…)

Restructure or reformat dataframes in R with melt

Sunday, March 23rd, 2008

The basic idea is to take an R dataframe like this one containing abundance of three species at each site, and elevation at each site

site  sp1 sp2 sp3 elev
a      3   4   9   100
b      1   8   10  210
c      4   8   15  165

and reorganize into something like this (perhaps so we can do an ANOVA using species as a factor):

site  elev  sp  abundance
a     100  sp1  3
a     100  sp2  4
a     100  sp3  9
b     210  sp1  1
b     210  sp2  8
b     210  sp3  10
c     165  sp1  4
c     165  sp2  8
c     165  sp3  15

Assuming the first dataframe above is called d, the second dataframe can be obtained using the following code:

> library(ggplot2)
> m = melt(d, id=c('site','elev'))

melt works like this: You specify the ID variables, which are those variables that will REMAIN as dataframe variables. Any others will be considered measured variables. If it’s easier for your data, you can do it the other way: specify the measured variables and the others will be considered ID variables.

Melting results in two new variables, variable and value. variable contains the names of the original columns of the dataframe as factors, and value contains the corresponding values.

Another example

Here’s another example using the built-in dataset, airquality. First, unmelted:

# make all the variable names lowercase
names(airquality) <- tolower(names(airquality))
head(airquality)
  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

and melted:

> head(melt(airquality,id=c('month','day')))
  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
5     5   5    ozone    NA
6     5   6    ozone    28

Why should I use a database for my data?

Saturday, March 22nd, 2008

Quick answer: when you have large amounts of inter-related data.

Long answer: (more…)