What test do I use? (statistical analysis)

Rotting Meat · February 2010

Looking for advice on what statistical test I can use to analyze a data set. I'm looking at infection status (present/absence) of individuals over two generations, and I have 6 treatments. I'd like to know what treatments are different from one another within each single generation. I'd also like to know which treatments have changed from the first generation to the second generation. I do not need to compare the first generation from one treatment to the second generation from another.

So what's the best way to analyze this? I'm assuming I don't just Chi-square the crap out of it by testing each pair of data points.

If it matters, I'm using the R stats package.

Thanks hugely.

Tinuz · February 2010

Answering this is nigh impossible, as we don't know how your data is divided, how many data points you have, etc.

I'd go for an ANOVA using a logistic model. Both of these things can be found on Wiki, and I suggest you read that before doing anything. Statistical testing is by no means easy (I am a statistician, although I focus mainly on estimating probability distributions) and making the wrong assumptions usually results in BS results.

Bliss 101 · February 2010

Are the samples grouped according to treatment, ie. no mixed treatments? If the samples can be grouped into independent groups based on treatment, and you want to minimize false positives, I'd go with one-way ANOVA. If you do have mixed treatments, I'd do a logistic regression with infection status as the intercept (outcome) and the treatments as coefficients.

If the data sets are independent from one another, and since you're working with binary variables (treatment/no treatment, infection/no infection), you can just chi-square the crap out of everything. Then just Bonferroni correct against multiple testing bias (set your statistical significance threshold at (0.05 / <number of tests>), assuming 0.05 would be the threshold you'd usually use. So with 12 tests, you'd accept p=0.004 as statistically significant). This is the way people usually do it in medicine/biology papers when working with a relatively small number of tests. People in the field are just so used to chi-square that it tends to be the best way to make your results understood (and easily compared to other studies), so we use it even when it's less than optimal. Whether it's appropriate or not depends a lot on your topic, your data, and how you discuss the results, as well as where the results are to be published.

There are better ways to correct against multiple testing, but unfortunately anyone reviewing your study isn't likely to understand them, which will cause problems when trying to publish your results. In some cases it's justifiable to just ignore multiple testing bias (because frankly you can lose all of your statistical power to detect anything when using Bonferroni correction, as it's an extremely crude and conservative method), especially if you can present a replication of the results in another population, or if presenting potentially novel discoveries is your goal and replicate studies by other groups are the norm in your particular field.

It's not illegal to do both ANOVA and chi-square and present the results together in one table. It'd even make a kind of sense: with ANOVA you get the comparison between treatments you want, and with chi-square you can evaluate the effectiveness of any single treatment independently of the others.

To see which treatments change between generations, I'd just do a Chi-square to test for difference between generations within each treatment group.

edit: pretty much all of the above assumes that you have a non-treated control group. Otherwise the chi-square approach makes no sense and ANOVA is your friend.

Penny Arcade

Quick Links

What test do I use? (statistical analysis)

Posts