<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>scienceoss.com &#187; data management</title>
	<atom:link href="http://scienceoss.com/tags/data-management/feed/" rel="self" type="application/rss+xml" />
	<link>http://scienceoss.com</link>
	<description>useful tidbits for using open source software in science</description>
	<lastBuildDate>Wed, 26 May 2010 03:34:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>MySQLdb &#8211; accessing MySQL databases from Python</title>
		<link>http://scienceoss.com/mysqldb-accessing-mysql-databases-from-python/</link>
		<comments>http://scienceoss.com/mysqldb-accessing-mysql-databases-from-python/#comments</comments>
		<pubDate>Mon, 24 Mar 2008 00:31:04 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[data management]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[MySQLdb]]></category>
		<category><![CDATA[syntax]]></category>

		<guid isPermaLink="false">http://scienceoss.com/mysqldb-accessing-mysql-databases-from-python/</guid>
		<description><![CDATA[MySQL is a popular open-source database engine, and Python interfaces quite nicely with MySQL with the MySQLdb package. For more on why you would want to use a database for your data, check out this post. Here I&#8217;ll show you how to connect to your existing MySQL database with Python. Assumptions I&#8217;m assuming you have [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.mysql.com/">MySQL</a> is a popular open-source database engine, and Python interfaces quite nicely with MySQL with the <a href="http://sourceforge.net/projects/mysql-python">MySQLdb</a> package.  For more on why you would want to use a database for your data, check out <a href="http://scienceoss.com/why-should-i-use-a-database-for-my-data/">this post</a>.  Here I&#8217;ll show you how to connect to your existing MySQL database with Python.<span id="more-5"></span></p>
<h3>Assumptions</h3>
<ul>
<li>I&#8217;m assuming you have a MySQL database running. <a href="http://scienceoss.com/why-should-i-use-a-database-for-my-data/">More info here</a></li>
<li>you have the <a href="http://sourceforge.net/projects/mysql-python">MySQLdb</a> package installed for Python.</li>
<li>The database is running on <span class="c">localhost</span>, the user is <span class="c">root</span>, and the password is <span class="c">p@55w0rd</span>.</li>
<li>You <a href="http://www.w3schools.com/sql/default.asp">know some SQL</a> (at least enough to appreciate some of these examples)</li>
</ul>
<h3>Caveats</h3>
<p>While the code below is specific to MySQLdb, no matter what database API you use you should be able to use the same syntax (as outlined in <a href="http://www.python.org/dev/peps/pep-0249/">PEP 249</a>).</p>
<p>For more details, see the <a href="http://mysql-python.sourceforge.net/MySQLdb.html">official documentation for MySQLdb</a>.  Here I&#8217;m just trying to explain things slightly differently.</p>
<h2>Example usage</h2>
<h3>Import MySQLdb, and connect to the database</h3>
<pre class="brush: python; title: ; notranslate">import MySQLdb
my_connection = MySQLdb.connect(host='localhost', user='root', passwd='p@55w0rd')
cursor = my_connection.cursor()</pre>
<p>That&#8217;s it!  You&#8217;re ready to start sending SQL statements to your MySQL database!</p>
<h3>The cursor is everything!</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Cursor_(databases)">cursor</a> now contains all the information it needs to send information to and get information from the running MySQL server.  It&#8217;s your key to the database.</p>
<p>The two most often-used methods of the MySQLdb cursor are</p>
<ol>
<li><span class="c"><strong>cursor.execute()</strong></span>, which executes a query (but doesn&#8217;t return the data)</li>
<li><span class="c"><strong>cursor.fetchall()</strong></span>, which fetches the data from the most recently executed query.</li>
</ol>
<p>You send commands to MySQL by passing strings of SQL statements to <span class="c">cursor.execute()</span>.  When doing so, you can take advantage of Python&#8217;s multi-line string (delimited by triple quotes (<span class="c">&#8220;&#8221;"</span>)) and the fact that SQL syntax doesn&#8217;t care that there are newlines in the query.  Furthermore, MySQLdb automatically adds semicolons to the end of SQL statements if you forget them.</p>
<h3>Interacting with the database</h3>
<h4>Create the database and a table</h4>
<p>Make a new database by sending the standard SQL query, <span class="c">&#8216;CREATE DATABASE testdb&#8217;</span>, to the database you connected to.  Note that MySQLdb automatically adds semicolons to the end of statments if you don&#8217;t add them yourself.</p>
<pre class="brush: python; title: ; notranslate">cursor.execute('CREATE DATABASE testdb')</pre>
<p>If you do this in an interactive session, you will notice that this method returned a long format integer (1L).  This is the number of lines returned by the cursor.  Don&#8217;t worry about it quite yet.</p>
<p>Now make that new database the active one:</p>
<pre class="brush: python; title: ; notranslate">cursor.execute('USE testdb')</pre>
<p>Now create a table in the <span class="c">testdb</span> database to hold some addresses:</p>
<pre class="brush: python; title: ; notranslate">cursor.execute('''CREATE TABLE addresses (
                    name VARCHAR(20),
                    street VARCHAR(20),
                    zipcode INT,
                    city VARCHAR(20),
                    state CHAR(2)
                    )
                    ''')</pre>
<p>Note the use of triple quotes so that you can visually organize the SQL query for clarity.</p>
<h4>Import data from Python into MySQL</h4>
<p>The general syntax for passing Python data to an SQL query through the cursor is:</p>
<p><strong>
<pre class="brush: python; title: ; notranslate">cursor.execute(SQL,tuple)</pre>
<p></strong><br />
where <span class="c">SQL</span> is a valid SQL statement.  If <span class="c">SQL</span> has N placeholders of the form <span class="c">%s</span>, then <span class="c">tuple</span> must have length N.  Hopefully an example will make more sense.</p>
<p>Let&#8217;s create some Python lists that we&#8217;ll import into this table.  The beauty of it is that these data could have been parsed from a text file with hundreds or thousands of names, and we can import them into the database automatically.  For now we&#8217;ll just enter three records though.</p>
<p>Here&#8217;s the data that will go into the database:</p>
<pre class="brush: python; title: ; notranslate">names = ['Bob', 'Alfred', 'Jen']
streets = ['123 Elm Street', '55 Ninth Ave', '1 Paved Rd']
zips = [00123, 34565, 30094]
cities = ['Newark', 'Salinas', 'Los Angeles']
states = ['NJ', 'CA', 'CA']</pre>
<p>And here&#8217;s how to get that data into the <span class="c">addresses</span> table:</p>
<pre class="brush: python; title: ; notranslate">cursor.executemany('''INSERT INTO addresses
                     (name, street, zipcode, city, state)
                     VALUES
                     (%s, %s, %s, %s, %s)''',
                     zip(names, streets, zips, cities, states))</pre>
<p>A couple of things to note here:</p>
<ul>
<li>This time we used <span class="c">cursor.executemany()</span>, which will accept a list of lists as input, instead of <span class="c">cursor.execute()</span>.</li>
<li>There were 5 fields into which we inserted data (name, street, zipcode, cities, and state)</li>
<li>There were 5 <span class="c">%s</span> placeholders in the SQL query.</li>
<li>Even though zipcode is an INT field and not a string, we used %s.  This will always be the case:<em> use %s as a placeholder no matter what the datatype</em>.</li>
<li>There were 5 lists that were zipped together.  They need to be zipped so that the result is a list of lists, and the length of each item in the list = 5.</li>
<li>The order in which these lists were zipped corresponded to the fields into which they were to be inserted.</li>
</ul>
<h3>Retrieving data from the database</h3>
<p>There are two steps to retrieving data: executing the query, then fetching the results.</p>
<p>To select all addresses in California, first execute this query (it&#8217;s a one-liner so triple quoting isn&#8217;t really needed)</p>
<pre class="brush: python; title: ; notranslate">cursor.execute(&quot;SELECT * FROM addresses WHERE state = 'CA' &quot;)</pre>
<p>Alternatively . . . often you will want to feed Python variables into the query.  Say the state abbreviation &#8216;CA&#8217; is saved in a Python variable called <span class="c">my_state</span>.  Then this query will do the same thing as the one above:</p>
<pre class="brush: python; title: ; notranslate">cursor.execute('''SELECT * FROM addresses WHERE state = %s''', my_state)</pre>
<p>By the way, <span class="c">my_state</span> is not a tuple, but that&#8217;s OK since there is only one <span class="c">%s</span> placeholder in the query.  MySQLdb knows where it should go.</p>
<p>Now to retrieve the results:</p>
<pre class="brush: python; title: ; notranslate">results = cursor.fetchall()</pre>
<p>Note that a cursor object is similar to a file object or an iterator: <em>once you fetch everything, there is nothing left in the cursor to retrieve</em>.  So executing the command above a second time would result in an empty list until the query is executed again.</p>
<p><span class="c">results</span> is a tuple of tuples and looks like this:</p>
<pre class="brush: python; title: ; notranslate">(('Alfred', '55 Ninth Ave', 34565L, 'Salinas', 'CA'),
 ('Jen', '1 Paved Rd', 30094L, 'Los Angeles', 'CA'))</pre>
<p>That&#8217;s all there is to it!  Armed with this knowledge, now you can execute queries from Python to import, retrieve, and plot data from your database.  This was a simple demo of what MySQL and Python can do, but you can construct ever-larger databases and ever-more-sophisticated queries to manipulate data in ways that would be impossible without these tools.</p>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/mysqldb-accessing-mysql-databases-from-python/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Restructure or reformat dataframes in R with melt</title>
		<link>http://scienceoss.com/restructure-or-reformat-dataframes-in-r-with-melt/</link>
		<comments>http://scienceoss.com/restructure-or-reformat-dataframes-in-r-with-melt/#comments</comments>
		<pubDate>Sun, 23 Mar 2008 22:06:17 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[data management]]></category>
		<category><![CDATA[dataframe]]></category>
		<category><![CDATA[melt]]></category>

		<guid isPermaLink="false">http://scienceoss.com/restructure-or-reformat-dataframes-in-r-with-melt/</guid>
		<description><![CDATA[The basic idea is to take an R dataframe like this one containing abundance of three species at each site, and elevation at each site site sp1 sp2 sp3 elev a 3 4 9 100 b 1 8 10 210 c 4 8 15 165 and reorganize into something like this (perhaps so we can [...]]]></description>
			<content:encoded><![CDATA[<p>The basic idea is to take an R dataframe like this one containing abundance of three species at each site, and elevation at each site</p>
<pre class="prettyprint"><code class="code">site  sp1 sp2 sp3 elev
a      3   4   9   100
b      1   8   10  210
c      4   8   15  165
</code></pre>
<p>and reorganize into something like this (perhaps so we can do an ANOVA using species as a factor):</p>
<pre class="prettyprint"><code class="code">site  elev  sp  abundance
a     100  sp1  3
a     100  sp2  4
a     100  sp3  9
b     210  sp1  1
b     210  sp2  8
b     210  sp3  10
c     165  sp1  4
c     165  sp2  8
c     165  sp3  15</code></pre>
<p>Assuming the first dataframe above is called <span class="c">d</span>, the second dataframe can be obtained using the following code:</p>
<pre class="prettyprint"><code class="code">> library(ggplot2)
> m = melt(d, id=c('site','elev'))
</code></pre>
<p><span class="c">melt</span> works like this: You specify the ID variables, which are those variables that will REMAIN as dataframe variables.  Any others will be considered measured variables.  If it&#8217;s easier for your data, you can do it the other way: specify the measured variables and the others will be considered ID variables.</p>
<p>Melting results in two new variables, <span class="c">variable</span> and <span class="c">value</span>.  <span class="c">variable</span> contains the names of the original columns of the dataframe as factors, and <span class="c">value</span> contains the corresponding values.</p>
<h3>Another example</h3>
<p>Here&#8217;s another example using the built-in dataset, airquality.  First, unmelted:</p>
<pre class="prettyprint"><code class="code"># make all the variable names lowercase
names(airquality) <- tolower(names(airquality))
head(airquality)
  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
</code></pre>
<p>and melted:</p>
<pre class="prettyprint"><code class="code">> head(melt(airquality,id=c('month','day')))
  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
5     5   5    ozone    NA
6     5   6    ozone    28</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/restructure-or-reformat-dataframes-in-r-with-melt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why should I use a database for my data?</title>
		<link>http://scienceoss.com/why-should-i-use-a-database-for-my-data/</link>
		<comments>http://scienceoss.com/why-should-i-use-a-database-for-my-data/#comments</comments>
		<pubDate>Sat, 22 Mar 2008 00:23:06 +0000</pubDate>
		<dc:creator>ryan</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[data management]]></category>
		<category><![CDATA[databases]]></category>

		<guid isPermaLink="false">http://scienceoss.com/why-should-i-use-a-database-for-my-data/</guid>
		<description><![CDATA[Quick answer: when you have large amounts of inter-related data. Long answer: To use a database, you&#8217;ll have to learn another language, so you&#8217;ll have to decide for yourself if that &#8220;intellectual investment&#8221; is worth it. You also have to decide if you have enough data to make it worth the investment. As far as [...]]]></description>
			<content:encoded><![CDATA[<p>Quick answer: when you have large amounts of inter-related data.</p>
<p>Long answer: <span id="more-99"></span></p>
<p>To use a database, you&#8217;ll have to learn another language, so you&#8217;ll have to decide for yourself if that &#8220;intellectual investment&#8221; is worth it.  You also have to decide if you have enough data to make it worth the investment.</p>
<p>As far as learning a new language, SQL is one of the easier languages to learn (you can probably figure out what the following SLQ query does: SELECT title FROM books WHERE author = &#8216;John Tukey&#8217;).</p>
<p>As far as how much data is enough to justify a database, well that&#8217;s pretty subjective.  But let me give you a specific example that might help guide your decision.</p>
<h3>An example from personal experience</h3>
<p>I was working on a project where we put out temperature loggers at many different sites.  Every two weeks or so the temperature loggers would be moved to a different site, and we would download the data and keep them as text files.  </p>
<p>Now, if there were just one or two temperature loggers, I probably wouldn&#8217;t make a database.  If I wanted to compare sites, it would be easy enough to load the one or two text files from each site and compare them.</p>
<p>If I had, say, 10 temperature loggers but they stayed in the same place all the time, then I probably would write a Python script that read the data in from the text files each time I wanted to analyze the data.  Still not enough to justify a database in my opinion.</p>
<p>But I had 48 temperature loggers all at different locations.  Keeping track of text files would quickly get out of hand.  Plus, there was other information that went along with the different sites.</p>
<p>Having extra data that went along with the sites is important.  That is information that was NOT contained within the temperature logs, it was in my field notebook (and ultimately in a spreadsheet).  Without a relational database, it would be difficult to pull out the temperatures where the sites were mud as opposed to sand.  In fact, it would be awkward to do that if I only had 10 temperature loggers, let alone 48.  </p>
<p>When you have this sort of data &#8212; where some parts can be related to other parts &#8212; it&#8217;s a good sign you should be using a database.</p>
<p>This is only an example of one case where you might want to use a database.  You could use a database for anything from organizing recipes to clinical studies to meta-analysis of literature.</p>
<h3>Next steps</h3>
<p>Once you&#8217;ve decided to use a database, you&#8217;ll have to decide which one to use.  The two most popular open source ones are <a href="http://www.postgresql.org/">PostgreSQL</a> (here&#8217;s how to <a href="http://www.postgresql.org/files/postgresql.mp3">pronounce it</a>) and <a href="http://www.mysql.com/">MySQL</a>.</p>
<p>To get started, check out <a href="http://http://www.apachefriends.org/en/xampp-windows.html">XAMP</a> for Windows, <a href="http://www.mamp.info/en/index.php">MAMP</a> for Mac, or <a href="http://www.linuxhelp.net/guides/lamp/">LAMP</a> for Linux.  Note you only need Apache for using phpMyAdmin, the user friendly, browser-based database management program.  You might try <a href="http://www.mysql.com/products/tools/query-browser/">MySQL Query Browser<br />
</a> instead of running Apache and phpMyAdmin.</p>
<p>Next up is <a href="http://www.tomjewett.com/dbdesign/dbdesign.php?page=intro.html">designing your database</a>.  Databases are just a bunch of separate tables, and they are not truly connected until you perform a query.</p>
<p>Once everything is ready, you&#8217;ll want to import your data into it.  My personal preference is to use Python to parse data files, and then insert the data into the database using something like MySQLdb (a way for Python to talk to a MySQL database).  If you have simple files and don&#8217;t need the flexibility of Python, you can use a command like MySQL&#8217;s <a href="http://dev.mysql.com/doc/refman/5.0/en/load-data.html">LOAD DATA</a> command to do it for you. </p>
<h3>Final notes</h3>
<p>Keep in mind that taking the plunge and porting your data into a database can be a large project.  But the flexibility you gain by using a database allows you to easily call up combinations of data that would be laborious without it goes a long way.</p>
]]></content:encoded>
			<wfw:commentRss>http://scienceoss.com/why-should-i-use-a-database-for-my-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
<enclosure url="http://www.postgresql.org/files/postgresql.mp3" length="5747" type="audio/mpeg" />
		</item>
	</channel>
</rss>

