Inferential Statistics Questions - Not Homework!

Enc · July 2021

Long time no type, folks! I still live, and am still working at the university, but my old office was sacked during COVID and I've been bounced across four units and counting being a "fixer" to solve various office problems, and once solved get moved to a new failing office to sort things out.

The situation:
In recent weeks I've been transferred to a data unit to cover for a supervisor who took on all sorts of tasks the office wasn't equipped for and then fucked off to the private sector, leaving the office (and a large number of stakeholders) in something of a lurch. I've been able to put out most of the fires since being brought over here, but mostly the office is equipped for large set data analysis on a more descriptive statistics side of things. We have never had to do inferential statistics for the basic role. Enter this final problem, from which the former supervisor said "YES, WE CAN DO THIS!" literally on the day he left and left us holding the bag. While all of us ~have~ done statistics in the past, the most recent one of us had grad statistics something like 6 years ago so we are super rusty. We need guidance toward what tests we should be doing with this data to answer the questions asked of us. We know how to run the tests (that's easy enough for us to find out), but currently we are at odds on if ANOVA, Chi Squared, or something else would be the best tools here.

The data:
We have a set of data which is essentially as follows:

Variable 1: Pass/Fail (How many students passed a specific test). This is represented in raw counts.
Variable 2: Cohort Name: Special Cohort A, Special Cohort B, Comparison Group (Randomly Selected Student Sample who weren't in those cohorts)
Variable 3: Cohort Hear (2020-21, 2019-2020, 2018-19 etc. For a total of five years for each cohort)

The questions:
We have been tasked with answering the following questions:

Are test pass rates different then a Comparison Group for Special Cohort A and Special Cohort B
Are test pass rates different between Cohort A and Cohort B
"Superiority" (by which we understand this to mean is one of the cohorts statistically "better" than the other)

Failure to deliver on this isn't the end of the world, but the unit that asked for this information from the supervisor that bailed has been poorly treated in the past and we would like to support them given that this helps a special needs target population and could lead to them getting funding which would help that population a whole lot. Any advice on what direction we should be going here is greatly appreciated!

MrBlarney · July 2021

I'm a number of years out from this kind of statistics as well, so take my ramblings with some caution. I've also only done a modicum of additional research, so there's certain to be gaps in my reasoning.

My gut reaction to the scenario was that it was a case for a two-way between-subjects ANOVA. The output variable being binary (0=Fail, 1=Pass) goes against ANOVA assumptions, but the test might still work as a first pass. If you've got statistical software to run the numbers through, it should be pretty straightforward to get numbers out. You'll probably need a little bit of work to set up the contrast weights so that you are comparing the Control vs. Special Cohort A and Special Cohort B with one contrast, then Special Cohort A vs. Special Cohort B with the second. Depending on how the interaction effects shake out, you might be able to ignore the effect of cohort year and run some simpler tests by combining the data into only three groups based on cohort name.

A little bit of additional internet searching, and it appears that a Chi-square test or logistic regression are better choices than the naïve ANOVA approach. With Chi-square, if the omnibus test finds some significant overall effect, you'll still need to have set up contrast tests to differentiate between levels of cohort name or year. With logistic regression, the inputs should be a bit more straightforward, but you may need to engineer your features (including your interactions) beforehand. I don't know if this kind of broad-strokes overview will be particularly useful, but I think that's all my brain's willing to muster for the time being. I'll try to dig in a little more and paste in links that seem useful as a follow-up if that's desired.

Arch · July 2021

I would use a Chi-Square analysis if the question you are asking is only the second one on your list. To hit all three questions I'd use binary logistic regression. This is actually a pretty standard analysis in educational datasets, and if you want an example of how to run this in R, I've actually got one handy because I'm doing something similar next month.

Check it out here.

The output of the logistic regression on your data will give you answers to these questions pretty easily by reporting whether or not there is a significant change in the odds of passing a test, and if your Variable 2 is a categorical variable (as Rank is in this example), it should give you the change in passing odds for each cohort, which can nicely answer all three questions in one test.

It can also give you interaction effects for cohorts by year if you want, but that seems not to be the questions you want answered, so I'd just leave those out of the analysis for now.

Enc · July 2021

Thanks all, this is super helpful!

PirateQueen · July 2021

If you want to include control variables, maybe logistic regression would also be a good option for looking at predictors of pass rates?

Good luck @Enc **

I think I can empathise with your plight as I've just been assigned 3 students whose supervisor left in the middle of their data collection. It's exciting though, right? : D

Penny Arcade

Quick Links

Inferential Statistics Questions - Not Homework!

Posts