The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.

Help Me Do Statistics!

ElJeffeElJeffe Registered User, ClubPA regular
edited October 2012 in Help / Advice Forum
So I want to do some statistical analysis on some data I have and see if there's a correlation. Specifically, I'm looking at frequency of infant sleep-related deaths during various times of the year, but let's pretend it's something happy, like how often angels get their wings, or whatever.

Statistical analysis is not normally a part of my job. Moreover, I'm not doing peer-related stuff, so I don't need to be super-rigorous. It's more a matter of helping folks determine how to devote resources, so finding clear trends is useful.

Now, I read over this site. It all makes sense and I get where everything is coming from. I want to do try and find an alpha level for the numbers I have, but I'm having a few troubles. But first of all:

A) I know that the accepted way of doing this sort of thing is to develop a hypothesis and then collect and crunch your data. I have looked at a bunch of data and formed a hypothesis based on that. Since I'm not publishing this stuff in a journal, I expect I can still glean some useful information from these numbers, even though I'm kinda doing it backwards. Is this reasonable?

Now, the data I have consists of a bunch of angels and... okay, fuck it, I'm not using the euphemisms. I have a bunch of deaths and the dates on which they occured. I have broken the calendar year up into chunks. So I basically have a single row of numbers indicating how many deaths occured during each period. The example on the website, though, involves two rows and two columns, and requires you to calculate a DoF based on the dimensions of your data table. The DoF is 0 if you have only one row. This seems problematic. What I tried doing, then, is structuring my data as follows:
Did Child Die?                Period1             Period2               Period3             ...
Yes                             (data)             (data)              (data)              ...
No                              (data)             (data)              (data)              ...

For the "No" row, I use the total infant population. This allows me to calculate a ChiSquare value and also a DoF value. There are probably various ways to represent "No", but the corresponding values in the chi-square table are going to be tiny compared to values corresponding with "Yes" any way you slice it, so it seems that bit is a little less important in terms of just getting an idea of how much statistical significance I'm seeing here. So:

B) Is this at all a viable means of doing this? Is there a better way to generate another row of data? Should I be calculating something other than a p-value for this stuff?

(Doing all this, btw, I wind up with a p<0.01, which is a good number. I would like to make sure it is a number that is actually representative of something.)

I submitted an entry to Lego Ideas, and if 10,000 people support me, it'll be turned into an actual Lego set!If you'd like to see and support my submission, follow this link.
ElJeffe on

Posts

  • Fuzzy Cumulonimbus CloudFuzzy Cumulonimbus Cloud Registered User regular
    What are you trying to correlate? Or, in other words, what question do you want to ask? Do you want to know if there is a relationship between mortality rate and day or something like that? You can't calculate Chi-square or any stats without a good data set.

  • ElJeffeElJeffe Registered User, ClubPA regular
    Looking at the data, it looks like there are spikes at certain times of the year. I would like to find a way to determine how likely it is that there is actually something going on there and that it's not just random noise resulting from a small data set. (FWIW, the number of data points is a few hundred-ish.)

    So I guess the question is: "Is there a correlation between mortality rate and time of year."

    I submitted an entry to Lego Ideas, and if 10,000 people support me, it'll be turned into an actual Lego set!If you'd like to see and support my submission, follow this link.
  • Jam WarriorJam Warrior Registered User regular
    edited October 2012
    If the total infant population is a constant, the don't add that extra row, no. You want to be looking up a 'One Sample Chi Square test' also called 'goodness of fit' where you test your observed data against the expectation of an equal number of deaths in each period.

    Jam Warrior on
    MhCw7nZ.gif
  • KiplingKipling Registered User regular
    edited October 2012
    You seem to be asking if time period one is equivalent to time period 2. The equivalent for a death rate is defects per million opportunities (DPMO). You can define your opportunity (1 opportunity to die over a time period), and the defect = death. The health care version is usually deaths/1000 births, so you should be well into the range where the below method is decent but not perfect.

    The way of doing it is just find a 95% confidence interval on those rates (12 if binned by month). The Analysis Toolpack in Excel will do that in descriptive statistics. So if you had 21, 21,22,23,30 for numbers, the confidence interval would be 23.4 +- 4.7.

    There is a more rigorous method to calculate the confidence interval in the case of discrete defects, but I'm giving you the quick answer.

    edit: Jam Warrior is correct if you don't have the entire population and are comparing a hospital to a baseline for the entire country.

    Kipling on
    3DS Friends: 1693-1781-7023
  • Jam WarriorJam Warrior Registered User regular
    edited October 2012
    Also a 1xN table is an exception from the old (rows-1)x(columns-1) equation and still has N-1 degrees of freedom.

    Jam Warrior on
    MhCw7nZ.gif
  • Jam WarriorJam Warrior Registered User regular
    edited October 2012
    Bang your numbers in this webpage and it should do it all for you!

    http://graphpad.com/quickcalcs/chisquared1.cfm

    Time periods in 'category', actual numbers of deaths in 'observed', overall average number of deaths per time period in 'expected' (i.e. the same in every box). Bob's your uncle.

    Jam Warrior on
    MhCw7nZ.gif
  • ElJeffeElJeffe Registered User, ClubPA regular
    Oh man, that web page is the best thing ever.

    I also feel happy that the results it gave me were the same as the answers I came up with using my making-shit-up method.

    Thanks!

    I submitted an entry to Lego Ideas, and if 10,000 people support me, it'll be turned into an actual Lego set!If you'd like to see and support my submission, follow this link.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited October 2012
    ElJeffe wrote: »
    A) I know that the accepted way of doing this sort of thing is to develop a hypothesis and then collect and crunch your data. I have looked at a bunch of data and formed a hypothesis based on that. Since I'm not publishing this stuff in a journal, I expect I can still glean some useful information from these numbers, even though I'm kinda doing it backwards. Is this reasonable?

    I see the actual math has been covered. Jam Warrior handled that pretty well.

    I just wanted to point out that looking at existing data and then forming a hypothesis is fine - this is what is usually called a 'natural experiment' in research methods classes. Entire fields of study are based around it: sociology and epidemiology basically wouldn't exist without natural experiments.

    At worst a natural experiment leaves you more open to confounding factors and alternative causal explanations than a laboratory experiment. Some people (usually either academics with a lot of disciplinary chauvinism or laypeople with barely enough knowledge to be dangerous) will reject natural experiments (and the entire body of research in fields where natural experiments are standard practice) with a glib "correlation isn't causation," but that's taking a genuine reason for skepticism and exaggerating it into willful ignorance.

    No single study (whether it's a lab experiment or a natural experiment or a case study) proves anything on its own, but a well-designed natural experiment can add a lot to a body of knowledge.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • Alistair HuttonAlistair Hutton Dr EdinburghRegistered User regular
    Feral wrote: »
    At worst a natural experiment leaves you more open to confounding factors and alternative causal explanations than a laboratory experiment. Some people (usually either academics with a lot of disciplinary chauvinism or laypeople with barely enough knowledge to be dangerous) will reject natural experiments (and the entire body of research in fields where natural experiments are standard practice) with a glib "correlation isn't causation," but that's taking a genuine reason for skepticism and exaggerating it into willful ignorance.

    I'm basically in complete agreement with you Feral but I think it should be emphasied that when performing a natural experiment you have to be especially careful with how you treat the data before running it through that statistical tests of choice. So in ElJeffe's case how he buckets the data into time periods is important and he needs to make sure he hasn't chosen the buckets to match up to the spikes in the data and that the buckets chosen are independent of the data.

    (Of course the other side of this is if you choose, say, monthly buckets and there's an event that spans the end of one month and the beginning of the next then you could easily miss it's significance because your boundaries has split it across two buckets).

    I have a thoughtful and infrequently updated blog about games http://whatithinkaboutwhenithinkaboutgames.wordpress.com/

    I made a game, it has penguins in it. It's pay what you like on Gumroad.

    Currently Ebaying Nothing at all but I might do in the future.
  • ElJeffeElJeffe Registered User, ClubPA regular
    My "buckets" consisted of half-month intervals, with the numbers normalized for the varying numbers of days. (Since the second half of January, for example, has more days than the second half of February.) I settled on the periods after noticing something interesting around a particular time of year, but I basically just came up with half-months because it looked like there might be something going on that full-months wouldn't show on account of insufficiently fine resolution.

    I ended up finding some interesting trends in other areas that I completely didn't expect, so I'm reasonably confident that I wasn't effectively designing an experiment to demonstrate a particular phenomenon.

    It's a fairly informal process, but the point I'm trying to communicate is more, "Hey, it looks like something's going on here, maybe we should consider that." Not so much: "I HAZ THEORY!"

    I submitted an entry to Lego Ideas, and if 10,000 people support me, it'll be turned into an actual Lego set!If you'd like to see and support my submission, follow this link.
  • Inquisitor77Inquisitor77 2 x Penny Arcade Fight Club Champion A fixed point in space and timeRegistered User regular
    It sounds like you're fine. At least now you have some empirical measurements to promote further investigation, which is always a good thing. As long as people aren't intending to misuse or misinterpret the data, it's all good.

    As an aside, I work in HR consulting with sensitive data all the time; just want to make sure that you're covered from a legal/professional standpoint to be crunching numbers on this type of information. Things can be very strict and you may be opening yourself (and your employer) to significant liability if you are looking at personal health information or not handling it properly. There are even legal ramifications on simple technical details such as if you store it on a local hard drive vs. a remote access server.

    If you can get something in writing that says you are approved to look at and manipulate the data in question, that should help you CYA in case someone isn't doing their due diligence properly at your company. At the very least, they can't pin everything on you (i.e., they'd have to fire your boss and you...) if you're in violation of some laws, etc.

  • ElJeffeElJeffe Registered User, ClubPA regular
    There are laws and contracts to abide by, and I am doing so. A lot of the numbers I use are publicly available, though (including everything I'm working with for this side project). Any random shlub could order the stuff I'm looking at and do the same analysis. It just wouldn't be free.

    I submitted an entry to Lego Ideas, and if 10,000 people support me, it'll be turned into an actual Lego set!If you'd like to see and support my submission, follow this link.
  • Inquisitor77Inquisitor77 2 x Penny Arcade Fight Club Champion A fixed point in space and timeRegistered User regular
    OK, sounds like you're covered. Just FYI, even publicly available data can have some complicated privacy laws attached to them - for instance, if you're aggregating information from different sources into one "master" database, that could run afoul of some laws in some countries. If you're in the U.S., you should be fine, though. Only Europe and other backwards countries protect people's privacy. O_o

  • grouch993grouch993 Both a man and a numberRegistered User regular
    We used Minitab and SPSS for statistical regression. Haven't kept up with the field in a while, over 30 years, so not sure what the latest and most useful is these days.

    But with those, we would state that X and Y were highly correlated and the other factors were removed due to statistical insignificance. Then give some general details about the data set.

    Remember one data set that was a big city in the summer. Temperature, food costs, and a number of other factors were thrown out. The number of murders in the city were highly correlated to the number of dentists.

    Steam Profile Origin grouchiy
Sign In or Register to comment.