I have to analyse some data from word association tests I've done with a number of subjects, and I'd like to learn how to use a database to help me organize the data cleanly. There may be a way to do this nicely using a simple spreadsheet as well, which I would not be opposed to, since I actually have some experience working with spreadsheets, and database work is completely new to me.
The only software I have right now is Open Office on my Macbook. I am not fully opposed to purchasing something if it's necessary, but I'm nearly certain I should be able to do what I want to do with what Open Office offers. I just can't seem to figure it out. This has to be possible, right?
I have some data for each subject of my study, and then their responses to a series of prompt words. So a simple table might look like
Subject Name - Age - Location - English Level - Response to Prompt 1 - Response to Prompt 2 - Response to Prompt 3
The issue is that for each response I'd also like to add an additional field wherein I classify the response according to various types. These might look like:
Response to Prompt 1 (black)
white (opposite)
horse (set phrase)
hack (similar sound)
I will then go and analyze the data based on a number of different criteria. For example, who gave the most "opposite" responses? Did younger subjects give more "similar sound" responses?
The number of subjects in this study is small enough that I could go through it all by hand and do this with a few different spreadsheets. But I suspect learning how to use database software will go a long way toward making this faster, as well as reducing the likelihood of errors.
Any advice on how to go about storing this data would be appreciated!
Posts
Work yourself through this tutorial and maybe have a look/listen at a youtube video, though my cursory search for one brought up a lot of useless amateur videos. (
Basically that pdf tutorial I have linked will walk you through setting up a database, and you could probably just swap the info you want into the examples and just go from there. If you run into specific questions, though, that will be where we (or just me?) can help.
I see 2 tables. Table Subject with fields 1: auto number (primary key), name (first) name (last) Age Location English Level
Table 2 Response 1 (drop down list) Response 2 (drop down list) Response 3 (drop down list)
You could do this within a spread sheet as table 1 = worksheet 1, and table 2 = worksheet 2.
Then, you can use filters to get the various responses (queries) to questions like age = 23 and response 1 = horse and english level = proficient.
MySQL is free, and fairly easy to use to build from spreadsheets. OracleSQL is also free, but runs on java, and Finally Microsoft SQL is also free.
I would build your spreadsheets (row 1 is header) and then import into any of these database programs.
Then, for really fancy, you can use FlySpeed SQL Query to throw visual queries of the related tables and fields and it will give you both results and the SQL built to run those results (SQL = structured query language, or how do i ask database)
As for UncleChet's advice, I'm not sure that I follow. I understand table 1 just fine, but I really don't understand table 2. Would I need to create a new table 2 for each prompt word?
Subject table with the following fields. These records would be unique per subject.
SubjectID
SubjectName
Age
EnglishLevel
The second table would have the responses. There will be 3 records for every subject, per response.
SubjectID
PromptNumber
Response
ResponseType
You would then link the tables in whatever queries you would use through the SubjectID. This way, you can aggregate total number of response types over the whole set, or to whatever level you like (response number, subject, etc.).
Depending on what tool you might be using, I might be able to provide extra guidance.
Basic databases are a series of tables of information that are linked somehow. You probably already know that. There's two main reasons why you'd want to do this.
1. You have a lot of data.
Imagine you're running a Gym of 20 members. For each member, you have 3 sets of data which are used for different purposes:
A) Their gym routines (pullups, weights, cardio, etc.) - helps you decide what gear to buy, or whether enough people would be interested in group classes for a particular exercise etc.
Their billing information - helps you decide if you should offer per session payments, monthly autopayments, annual memberships etc.
C) Their demographics (age, gender, etc.) - helps you target your advertising and marketing to who seems to like your gym.
You've only got 20 members, so you say "Screw it, I'm just going to jam it all in one table". One day you're approached by National Gym Franchise - "Your system is amazing! You always know what classes to schedule, where to advertise, and what billing packages to offer! We want you to take your system and put our 800,000 national members in your database".
Now whenever you run your Excel-equivalent on your 1 Table to get your reports and analytics, you can go and have lunch in the time it takes for it to go through the Demographics AND the Billing AND the Routines - even though each individual report will ignore 2/3 of the information when it runs. "Sheesh", you think. "It would have been so much easier to break it into 3 chunks right back at the start so it'd run in a fraction of the time".
In your case, you seem to have two main divisions of data - the demographics (1) and the test results (2). Since you mention you can enter this all by hand, it doesn't sound like you have enough data where multiple tables would give a performance bump, but people who regularly work with databases would probably do it anyway because (a) It's good practice, and (b) You never know when you might get bought out by Facebook.
2. Your data is collected and laid out in different formats.
Now, instead of a Gym imagine you're running a Bank of 20 members. When each member signed up, they were assigned a Bank Account number, you took all their contact information, assigned them an account type, and gave them a Bank of Garroad_Ran creditcard.
Every day, these members go about their daily routines... with their creditcards. Some may use it hardly ever, some may use it multiple times per day. At the end of each day, your system spits out a list of all the transactions which your members have undertaken during the day that you have to settle. So you have 2 tables - 1 of all your members and their personal details, 1 of all the transactions (in which members may be listed multiple times or not at all) - laid out completely differently but still linked because both contain the Bank Account numbers.
In your case, all the information is collected at the same time in the same format so it's not particularly necessary to do the extra work to separate them out.
Excel and Excel-Equivalents (like Calc) can act as multitable databases with their Sheets. Sheet 1 would just contain all the Demographics, while Sheet 2 would contain all the Test Scores (with each sheet having their SubjectID on the left column so when you manipulate the data you can still work out which test scores go with which demographic data). If you're going to use Pivot Tables (or equivalent), it may be easier just to keep it all in one sheet.
Hope that's useful!
Unfortunately, due to travel and lack of time it's going to take me a few weeks to really make this happen. I'll post again if and when I get it all figured out!
Reasons I enjoy Google Spreadsheet:
The most basic con is that it's Google and that there are some limitations in regards to how much data you can call upon in Spreadsheets. In my own work, I am experiencing some tendency for my code to 'hang' due to server capacity limitations on Google's end (but then again, it's free so I can't really complain).
There are a few other reasons to use a database, for example:
- if there may be multiple sources that are all trying to modify data and/or read data at the same time
- databases can generally guarantee that you won't lose data because your machine crashes
- if you need to run queries on your data very quickly and want it to handle a lot of the optimization for you
The thing that strikes me about your post is it's a little unclear what you're hoping to get help on other than organizing your data, which is why some of the responses went directly into the table schema (organization).
It seems like there might be several things you might be asking about:
- how to set up a database to use (e.g., what software to use and how to get it configured)
- how to organize your data in the database (e.g., the schema)
- how to "use" a database (e.g., the commands to actually get the data into the database and how to access your data)
Is it just one of those things or all of them or some subset? We might be able to get you connected to better resources for any of those steps if you can provide a bit more detail on where you're at