The new forums will be named Coin Return (based on the most recent
vote)! You can check on the status and timeline of the transition to the new forums
here.
The Guiding Principles and New Rules
document is now in effect.
Organizing data: help me be a respectable scientist.
So, I am currently trying to get a PhD (operative word: TRYING) in neuroscience. So far, so good, but I am running into trouble when organizing all my data. My boss always used Excel, but I loathe that program and getting data in there generates infinite workbooks where you have to scroll forever until the sweet release of death embraces you. It's also very inefficient, since you need a cell for each variable and not all types of data have all the variables (those different types of data still need to stick together, though).
So, since there seems to be a good number of knowledgeable people here about technology and such, I was wondering whether any of you knows of some tool that could make this easier. What I need is some kind of hierarchical organizer. Each particular data point must have a number of "sub points" (each with their particular variables) stored within it and must have some metadata attached to it.
Is there anything out there that I can use that does not require l33t programming skills? I'm literate in matlab, but I don't know if I have the time to learn a programming language from scratch.
0
Posts
That said, there are definitely other programs; likely much better. Sorry I can't suggest anything specific; I spent my PhD time juggling matlab and excel :P
I love my job security and everything but it pains me to see that scientists aren't trained to learn even rudimentary programming skills. Don't be like that.
EDIT: I missed the MATLAB part here. It should suffice for at least CSV format.
It's not about lack of will, you see, it's about time. I have to dedicate my time to my actual primary purpose, which is research. I already design my own Matlab analysis tools and if I had to learn a new language for every tool I want to use I would never get any actual work done.
So I'm sorry for your pain, but if I can find an easier way to do this than learning Python, I totally will use it.
This is a classic "teach a man to fish" type problem. You're employing a false economy to think that finding a "quick-fix" solution in the form of some GUI tool is going to solve your problem. How long is it going to take you to learn this new tool? Deal with its bugs? What if your requirements vary slightly from the specifications, and now you're scouring some dedicated help forums on another website for hours or days waiting for some hacky patch or plugin.
Your stated primary purpose --- research, is increasingly becoming computational. I'm not saying you have to be a machine learning maven, but you're seriously crippling yourself by not knowing at least one general programming language; you don't need to "learn a new language for every tool you want to use." Not trying to start a D&D thread, but I see fellow scientists struggling with this all the time, and this is my honest advice for them. The alternative is to become a professor and use a high-level language (i.e., your grad students) do it for you :P.
He already said he knows MATLAB, which is a general programming language and widely used in Neuroscience (and other) research. Based on what I've seen in the neuroimaging labs I've worked in, he's already ahead of the curve if he knows MATLAB.
Australopitenico, you could try using a database to organize the data (MySQL/PostgreSQL are both common open source options, but even an Access database could work fine). For most of our questionnaire/demographic/etc data, we have it entered in a MySQL database based on subject ID, visit date or number, etc. Though honestly, for the majority of work that actually gets done with the data, it ends up in either Excel, SPSS, or just text CSV files (for analysis in MATLAB or R) eventually.
What you need to understand about biologists and biology-based scientists is that unless the term "engineer" comes after the name, programming is not usually part of training or education or even something you'll need day-to-day in the workplace pretty much ever, so saying it's part of basic scientific literacy is just mind-blowingly off. You should probably can the reproachful tone since it sounds like you don't actually have an idea that doesn't include "learn programming". I'll admit that my statement comes off sounding a lot more like a suggestion than it actually is, but I assure you it is not one.
This is H&A, right? Through undergrad and into graduate school I worked with several "non-engineers" on various multidisciplinary projects including with "non-scientists" in humanities such as linguistics. I gave my honest advice about what this guy should do given what I've seen in these projects. Any further discussion on this will be a debate. I'll be happy to defend my position any day, but it is obvious that because people don't like this opinion then it will be construed as a form of trolling.
This is an extremely violent response to what basically boils down to me encouraging someone to learn a new skill. If I'm working on a new project involving linguistics, I don't get all huffy if someone suggest that I take a basic course about grammars and lexicons. I'm guessing I would get a similar response if I wanted to work on a neuroscience-related topic, but people wouldn't be nearly as vehement about touting the impracticality of me, the engineer, learning something about the structure of the brain. In the same vein, I don't think it's ludicrous of me to suggest he take a course, or mini-course on some programming language, regardless of what they do in the "workplace."
So, agree to disagree?
The question of whether everyone should know how to program and to which extent should they know it is something for another thread. I'll just say that if there exists a tool that already does what I want I am not going to build a new one from scratch, no matter how many engineers I piss off, the same way I send my defective hardware to the electronics workshop when it breaks down instead of learning electronics from scratch, as useful a skill as it may be.
Oh man, I wasn't suggesting that you write your own database program, at all. That would be stupid even if you were a "l33t programmer." If I wasted time writing my own database, not only would I not get anything done, I would probably get fired. Maybe this is why we have a misunderstanding? It was more along the lines of writing a small script (not even an application) to store/retrieve from whatever data format you choose (be it SQL, CSV, etc.). While I dislike MATLAB with a passion, it should be sufficient for this purpose. As suggested, it would be optimal for you to use a relational database of some sort, but I think the learning curve for that is quite higher than, let's say, a language like Python IMO.
If you're sick of Excel, improving your programming skills is the next logical step. To take your analogy of defective hardware...Excel does not have any actual bugs that are stopping you, it's just that you want more control. It would be like if you came to me and said that you're sick of all the extra crap that comes with buying a pre-built computer. Then, naturally I would suggest that you might want to look into building your own. Yet, no one in this thread would decry me for being an elitist douchebag for suggesting such a thing, because for some reason that is relatively common.
A language like Python would be ideal, but MATLAB is totally fine for what you're describing. I'm not a MATLAB expert, so I can't help you without going through the documentation myself; but, if you'd like a sample of a simple Python + DB solution you can PM me (you can output from the DB to whatever standard format you want).
EDIT: I guess what I'm trying to say is that if you hate Excel, then learning a high-level language, or improving your skills in MATLAB is the next step. It's the only "game in town" AFAIK. Either that, or as others have mentioned, bite the bullet and use Excel. If anyone has another solution, I, personally, would be very interested in hearing it. As this tool would be very useful for my research as well.
It's basically a more powerful Excel.
It has the same general cell layout as Excel but with quite a number of handy functions stapled onto it.
The thing I've discovered about scientists is that they are as lazy about computer software as anybody else. Some don't like doing a lot more math than they can get away with either.
Engineers and mathematicians also have all manner of specialized commercial software for more rigorous data crunching.
Maple and Mathcad are examples where calculus can get involved. Basically they can spit out symbolic evaluations of whatever thing it is you're trying to do (at varying degrees of effectiveness). Maple seems like it can handle pretty complicated graphing tasks, though I haven't played with that program yet. Those might be what you're looking for.
I'm pretty sure those two programs are tip-of-the-iceberg though.
1) Is the issue the VOLUME of data? If so, you can set up a template and try to hide parameters that your instrument gives you that you do not need.
2) Is the issue the TYPE of data? Is this numerical data? If so, it probably belongs in excel. If it is not numerical data, it might just belong in your lab notebook until you do a manuscript.
Can you give an example of what you are generating and why you need to make a hierarchy out of it? Are you sure you even need to do that? I feel like (lol neuro people do this all the time :P) you might be unnecessarily reinventing the wheel.
I wish in my heart of hearts that we had an H/A subforum for scientists. It would be like all those molbioforums except everyone would be intelligible. :P
It's a neat idea, but I don't think it would see enough use. I toyed for about 6 months with the idea of starting a math help thread, but ultimately decided that it didn't come up often enough that there would be much traffic.
I did once start a science-y thread in SE but it only went for like 4 pages or something. I figure if it couldn't last there it probably wouldn't last here.
I would give them two options:
1. find a better system that is easier, faster, and more convenient
2. Suck it up
Sigmaplot is great but its really more of a graphing program. its a good one, but i wouldn't want to store anything in there since it has some weird rules to it
This makes conversion into a CSV and input into some other analysis program (like R, which I use) super easy. Perhaps this isn't what you're looking for, but I think you may be overcomplexifying things. And always make sure you know how you're analyzing your data before you start to collect it in any particular format, because converting it later is a huge pain in the arse.
Another speculative idea is to use VAMPP to run a SQL database on a local machine, and pull a Matlab file from FileExchange for SQL-Matlab communications. But make sure you lock VAMPP down to localhost access if you do that.
Just don't use Excel for graphs in publications. Please.
I'd support this as well - I basically avoid SE like the plague, so I definitely wouldn't have seen it there.
As to the original question - can you give an example of what kind of data structure you're talking about? It does (unfortunately) sound like some sort of programming language might be the easiest answer (since Matlab is ungodly terrible at dealing with text structures), but honestly learning enough Perl to do basic parsing if you know Matlab shouldn't be very difficult at all. This will depend a lot on how many layers & subgroups of data features you're talking about, though
If it's just storage, it may be that you want something like http://www.filemaker.com/products/filemaker-pro/ , which I know at least a few labs use to maintain strain / plasmid / etc databases, and doesn't require much back-end knowledge to get up & running
Oh I'm sure we can provisionally include math help into this thread?
Is that not a thing?
I could use a service like that.
If you have an idea how it might work, what you'd like to see, or just want to express a general interest in the idea, feel free to PM me about it and we'll see what happens if I get enough interest. For now let's let this thread get back on topic.
I meant "forum."
Blah.
Anyway, OP, it seems like you already have a pretty good idea of how you want your data to be formatted, but are having trouble finding or building a tool that will convert data from one format to another. If I were you I'd start with the "bribe an undergrad CS major with pizza and beer" approach. If it turns out that the job is actually bigger than that, tell your advisor that it would help you be more efficient and do better research if he would cough up the $10/hour to pay a decent CS undergrad to do it right. If every 2 hours of undergrad work saves you 1 hour of noodling around, that's money well spent. However this would only work if you are very certain about what you want and need, otherwise it would just be another time sink.
@k-maps, it was definitely a misunderstanding, your suggestion is very sound and it would also make a good practice exercise (I had actually already started to learn Python a bit).
I will also check MySQL, Libre Office and some of the other alternatives that have been proposed here.
@mts, as you can see, I AM looking for an alternative that's easier, faster and more convenient, that is exactly the point of this post. The fact that my boss uses Excel does not mean anything, my boss was using some really unnecessarily complex data analysis procedures that I am already optimizing so that any clueless undergrad can click a button and get what we need. I don't think it's wrong to streamline the whole data processing and storage systems if I feel they are making our life harder for no reason. If I don't find any good alternatives of course I will "suck it up".
As for the data examples some guys asked me for. You see, on the one hand there is a handful of detailed data for each animal. Then for each animal you have a number of recording sites, a.k.a the main data points, which have their own plethora of associated variables. The kicker is that those variables are each gathered from specific recordings (one file type gives you X, another gives you Y). The particular characteristics and identifier of these files must also be known and be easily associated with their particular data point (the summary of the data point is the "meat", is what you use for figures, statistical analysis etc.). Last but not least, the raw data from each recording are kept in a matlab file on a separate folder, and sometimes, depending on the analysis, they have their own small associated Excel spreadsheet.
Of course it can be done with Excel a but I just find the current method of having the animals on one spreadsheet, the associated files in other, and the SPSS-ready datapoints somewhere else impractical, I still might have to do it, but just wanted to check how other people did it.
Yeah, you're describing an object-oriented data structure. This means that your instincts are naturally leading you to a more representationally(?)/computationally complex problem. This is a great thing! I think a lot of new generation scientists tend to think this way (probably from growing up on complex strategy/rpg video games).
I second Pure Din that you should bribe a cs undergrad. I would have happily done this for you as an undergrad...that sort of thing is a great resume builder for someone with little experience, and it is a great small project for any decent sophomore+.
But, if you're already teaching yourself Python, great! . Just know that Python object-oriented features suck, but learning Java or Scala instead would be a PITA if you have no experience. At any rate you can probably get by with just using lists and dictionaries, maybe stored as JSON. Although if you're learning sql, even better. I would recommend sqlite over MySQL for now, as it requires zero work setting up.
Good luck with your endeavors, and PM me if you run into any problems.
yea, i totally see where you are coming from. finding the perfect management tool is like finding a gryffin. maybe see if you guys have a license for SPSS or whatever they are calling it now. it can do a shit ton of variables and the benefit is you can do all your stats in it. requires a bit of tinkering setting everything up but once you get it going you can just copy and paste directly from excel
most universities will have a enterprise license for it, though if that fails you can get a student license i think.
that may actually be your best bet. plus with some basic programming you can do some complex shit with it, though i may be thinking of mplus
I'd offer a few suggestions:
1) Rather than finding a CS major, see if there is anyone in your department or nearby that would be willing to handle your data management issues in exchange for an author listing or an acknowledgement. This almost always worked for me before I learned for myself.
2) SAS is very easy to learn and very powerful. Your uni likely has a deal where your lab can get it for ~$50, and The Little SAS Book will teach you everything you need to know. It can handle its own language or SQL, and will be a valuable tool in the future. It may be overkill for your current dilemma but IMO it is the easiest program to learn, and it has a nice GUI that will do a lot of your work for you.
3) I disagree respectfully but strongly about biologists not needing to be able to handle large datasets. Granted, my background is genetics, but in my opinion Big Data will become more and more important in the near future. Knowing basic SQL, Python or Perl will make you infinitely more marketable. I know it's not pertinent to the current problem, but it's my two cents.
IMO, there's the kicker. I work with those types of data structures pretty often. Excel would not be the right tool for this.
I think that a database with different tables is what you need.
Here is an example of a file system table that can contain meta data for each file stored. Files, in this case, is exactly like the files and folders stored on your hard drive. One file can reside in another file (which is a folder), for example. Folder item A has File item B under it, therefore B has a parent of A. Sort of a pyramid architecture, know what I mean?
Here is my SQL table definition (scaled down):
Now, for your meta-data, you could extend and carry over the ItemID into another table (disregard the data types):
You then need an engine and a front-end to manage this data. You could use SQL scripts to do it all. You could also use any language that can support SQL Server data access.
Many thanks @k-maps and @YoSoyTheWalrus for the software recomendations, and to @Bendit for the snippets of code . And you are right, Walrus, in my past as a field biologists data structures were pretty simple, but neuroscience is a totally different animal.
And about a science- or science careers-related D&D thread, I think it would be an awesome idea.
i don't know at this point, the best science careers advice right now is probably not to get into it. at least until things change funding wise
Without knowing the full details of your study, I'm going to add another vote to the SQL camp. For what you are working with, it should be relatively straightforward to build a simple database. At most a couple dozen tables, probably much less depending on the specifics of your study...MySQL or PostGRE are good free options...plus you've always got the standard MS & Oracle offerings - getting them isn't usually an issue if you are public and doing research.
If I were working with you, SQL would be a given, and all my questions would be based on your detailed requirements for data capture / entry and reporting / analysis. It's really just a matter of choosing the right tool for your application - there are countless ones out there, and they all work with the standard MySQL / PostGRE / Oracle database types.
A piece of advice...data management is a big part of any research, and it's only getting more important. In most labs, having a passing knowledge of your data tools is enough to make you the guru, and learning how they actually work is a major skill that offers a lot of benefits down the line. If you can develop your data management skills to a moderate level, you'll be able to use those skills and your degree to leverage a number of extremely lucrative positions. People who can do research are a dime a dozen. People who can do research and understand how to fully utilize their tools make A LOT of money and are always in high demand. At this point in history...don't discount having a fallback outside of research either. If you go IT with an unrelated PHD? Talk about $$$.
Another thing that I find a bit painful to suggest...depending on the size of your data set, and your scalability needs, you may want to look at Access. It's easy to use, ubiquitous, and may be solid 'enough'. If this is just for you though, you're probably better off using MatLab since you are already familiar with it.