So I want to do some statistical analysis on some data I have and see if there's a correlation. Specifically, I'm looking at frequency of infant sleep-related deaths during various times of the year, but let's pretend it's something happy, like how often angels get their wings, or whatever.
Statistical analysis is not normally a part of my job. Moreover, I'm not doing peer-related stuff, so I don't need to be super-rigorous. It's more a matter of helping folks determine how to devote resources, so finding clear trends is useful.
Now, I read over
this site. It all makes sense and I get where everything is coming from. I want to do try and find an alpha level for the numbers I have, but I'm having a few troubles. But first of all:
A) I know that the accepted way of doing this sort of thing is to develop a hypothesis and then collect and crunch your data. I have looked at a bunch of data and formed a hypothesis based on that. Since I'm not publishing this stuff in a journal, I expect I can still glean some useful information from these numbers, even though I'm kinda doing it backwards. Is this reasonable?
Now, the data I have consists of a bunch of angels and... okay, fuck it, I'm not using the euphemisms. I have a bunch of deaths and the dates on which they occured. I have broken the calendar year up into chunks. So I basically have a single row of numbers indicating how many deaths occured during each period. The example on the website, though, involves two rows and two columns, and requires you to calculate a DoF based on the dimensions of your data table. The DoF is 0 if you have only one row. This seems problematic. What I tried doing, then, is structuring my data as follows:
Did Child Die? Period1 Period2 Period3 ...
Yes (data) (data) (data) ...
No (data) (data) (data) ...
For the "No" row, I use the total infant population. This allows me to calculate a ChiSquare value and also a DoF value. There are probably various ways to represent "No", but the corresponding values in the chi-square table are going to be tiny compared to values corresponding with "Yes" any way you slice it, so it seems that bit is a little less important in terms of just getting an idea of how much statistical significance I'm seeing here. So:
Is this at all a viable means of doing this? Is there a better way to generate another row of data? Should I be calculating something other than a p-value for this stuff?
(Doing all this, btw, I wind up with a p<0.01, which is a good number. I would like to make sure it is a number that is actually representative of something.)
I submitted an entry to Lego Ideas, and if 10,000 people support me, it'll be turned into an actual Lego set!If you'd like to see and support my submission,
follow this link.
Posts
So I guess the question is: "Is there a correlation between mortality rate and time of year."
The way of doing it is just find a 95% confidence interval on those rates (12 if binned by month). The Analysis Toolpack in Excel will do that in descriptive statistics. So if you had 21, 21,22,23,30 for numbers, the confidence interval would be 23.4 +- 4.7.
There is a more rigorous method to calculate the confidence interval in the case of discrete defects, but I'm giving you the quick answer.
edit: Jam Warrior is correct if you don't have the entire population and are comparing a hospital to a baseline for the entire country.
http://graphpad.com/quickcalcs/chisquared1.cfm
Time periods in 'category', actual numbers of deaths in 'observed', overall average number of deaths per time period in 'expected' (i.e. the same in every box). Bob's your uncle.
I also feel happy that the results it gave me were the same as the answers I came up with using my making-shit-up method.
Thanks!
I see the actual math has been covered. Jam Warrior handled that pretty well.
I just wanted to point out that looking at existing data and then forming a hypothesis is fine - this is what is usually called a 'natural experiment' in research methods classes. Entire fields of study are based around it: sociology and epidemiology basically wouldn't exist without natural experiments.
At worst a natural experiment leaves you more open to confounding factors and alternative causal explanations than a laboratory experiment. Some people (usually either academics with a lot of disciplinary chauvinism or laypeople with barely enough knowledge to be dangerous) will reject natural experiments (and the entire body of research in fields where natural experiments are standard practice) with a glib "correlation isn't causation," but that's taking a genuine reason for skepticism and exaggerating it into willful ignorance.
No single study (whether it's a lab experiment or a natural experiment or a case study) proves anything on its own, but a well-designed natural experiment can add a lot to a body of knowledge.
the "no true scotch man" fallacy.
I'm basically in complete agreement with you Feral but I think it should be emphasied that when performing a natural experiment you have to be especially careful with how you treat the data before running it through that statistical tests of choice. So in ElJeffe's case how he buckets the data into time periods is important and he needs to make sure he hasn't chosen the buckets to match up to the spikes in the data and that the buckets chosen are independent of the data.
(Of course the other side of this is if you choose, say, monthly buckets and there's an event that spans the end of one month and the beginning of the next then you could easily miss it's significance because your boundaries has split it across two buckets).
I made a game, it has penguins in it. It's pay what you like on Gumroad.
Currently Ebaying Nothing at all but I might do in the future.
I ended up finding some interesting trends in other areas that I completely didn't expect, so I'm reasonably confident that I wasn't effectively designing an experiment to demonstrate a particular phenomenon.
It's a fairly informal process, but the point I'm trying to communicate is more, "Hey, it looks like something's going on here, maybe we should consider that." Not so much: "I HAZ THEORY!"
As an aside, I work in HR consulting with sensitive data all the time; just want to make sure that you're covered from a legal/professional standpoint to be crunching numbers on this type of information. Things can be very strict and you may be opening yourself (and your employer) to significant liability if you are looking at personal health information or not handling it properly. There are even legal ramifications on simple technical details such as if you store it on a local hard drive vs. a remote access server.
If you can get something in writing that says you are approved to look at and manipulate the data in question, that should help you CYA in case someone isn't doing their due diligence properly at your company. At the very least, they can't pin everything on you (i.e., they'd have to fire your boss and you...) if you're in violation of some laws, etc.
But with those, we would state that X and Y were highly correlated and the other factors were removed due to statistical insignificance. Then give some general details about the data set.
Remember one data set that was a big city in the summer. Temperature, food costs, and a number of other factors were thrown out. The number of murders in the city were highly correlated to the number of dentists.