Why should I use a database for my data?
Quick answer: when you have large amounts of inter-related data.
Long answer:
To use a database, you’ll have to learn another language, so you’ll have to decide for yourself if that “intellectual investment” is worth it. You also have to decide if you have enough data to make it worth the investment.
As far as learning a new language, SQL is one of the easier languages to learn (you can probably figure out what the following SLQ query does: SELECT title FROM books WHERE author = ‘John Tukey’).
As far as how much data is enough to justify a database, well that’s pretty subjective. But let me give you a specific example that might help guide your decision.
An example from personal experience
I was working on a project where we put out temperature loggers at many different sites. Every two weeks or so the temperature loggers would be moved to a different site, and we would download the data and keep them as text files.
Now, if there were just one or two temperature loggers, I probably wouldn’t make a database. If I wanted to compare sites, it would be easy enough to load the one or two text files from each site and compare them.
If I had, say, 10 temperature loggers but they stayed in the same place all the time, then I probably would write a Python script that read the data in from the text files each time I wanted to analyze the data. Still not enough to justify a database in my opinion.
But I had 48 temperature loggers all at different locations. Keeping track of text files would quickly get out of hand. Plus, there was other information that went along with the different sites.
Having extra data that went along with the sites is important. That is information that was NOT contained within the temperature logs, it was in my field notebook (and ultimately in a spreadsheet). Without a relational database, it would be difficult to pull out the temperatures where the sites were mud as opposed to sand. In fact, it would be awkward to do that if I only had 10 temperature loggers, let alone 48.
When you have this sort of data — where some parts can be related to other parts — it’s a good sign you should be using a database.
This is only an example of one case where you might want to use a database. You could use a database for anything from organizing recipes to clinical studies to meta-analysis of literature.
Next steps
Once you’ve decided to use a database, you’ll have to decide which one to use. The two most popular open source ones are PostgreSQL (here’s how to pronounce it) and MySQL.
To get started, check out XAMP for Windows, MAMP for Mac, or LAMP for Linux. Note you only need Apache for using phpMyAdmin, the user friendly, browser-based database management program. You might try MySQL Query Browser
instead of running Apache and phpMyAdmin.
Next up is designing your database. Databases are just a bunch of separate tables, and they are not truly connected until you perform a query.
Once everything is ready, you’ll want to import your data into it. My personal preference is to use Python to parse data files, and then insert the data into the database using something like MySQLdb (a way for Python to talk to a MySQL database). If you have simple files and don’t need the flexibility of Python, you can use a command like MySQL’s LOAD DATA command to do it for you.
Final notes
Keep in mind that taking the plunge and porting your data into a database can be a large project. But the flexibility you gain by using a database allows you to easily call up combinations of data that would be laborious without it goes a long way.
Tags: data management, databases