Quantitative Data Analysis

Kilgore Trout · April 2017

I'm toying with an idea for some independent research (i.e a passion project). In order to test my hypothesis I need to do some statistical analysis, however my stats expertise is a "stats for dummies because you need to do regression to complete this graduate program" course. I have a dataset to use, but the math required to demonstrate any statistical significance is far beyond me. I'm wonder if anybody can help with the scenario and hypotheses below. I'm open to someone just throwing a formula that I can run the dataset through, but I am also very interested in using this as an opportunity to actually learn what I am doing. Any suggestions or education are greatly appreciated.

Scenario
An oversight body is responsible for monitoring organizations for legislative compliance. Each year the oversight body reports - for each organization - how many complaints were received and how many violations were found. Due to the public nature of the complaints and reporting process I suspect that the number of complaints received in a given year increases or decreases based on the number of violations in the previous years. I can see this anecdotally in the dataset, but I can't determine if the pattern is statistically significant. There are five years of data available.

Hypothesis 1: If there is more than 1 complaint in a given year and no violations are found, then the number of complaints increases in subsequent years until a violation is found.
The argument being that if an organization is perceived as "getting away with" violations, more aggrieved parties will complain in the next year in order to bring the organization to account.

Hypothesis 2: If there and one or more violations found in a given year, then the number of complaints decreases in subsequent years until there are 1 or fewer complaints.
The argument being that once an organization has been ruled against, they will change their policies to avoid further complaints. Alternately, complainants will be satisfied that their grievances have been addressed through punitive action and will not file further complaints in the next year because they are satisfied that the situation will change. (I have to think of a way to explore which is the more likely scenario)

For context, the typical number of complaints each year ranges from 1 to 20. It is fairly common for individuals to file an frivolous complaint "to make a point" which is why I have put the threshold at "more than 1 complaint" in Hypothesis 1 and "1 or fewer complaints" in Hypothesis 2.

Blue Sky Thinking
I have no idea if this is possible or how it would work, but it would be fascinating if there was a way to develop a model to predict how long it takes after a violation is found for the number of complaints to drop to zero. Presumably the more complaints in the year the violation is found, the longer it would take.

Enc · April 2017

What is your dataset? Are you able to look at all complaints or are you going to try to use a smaller subset to represent a larger population?

I'm not an expert, but I would think a pair of one-tailed, one sided T tests would give you what you are looking for here, assuming you can lock down your population, alpha, and have two different datasets to compare (possibly with violations and without violations).

Fuzzy Cumulonimbus Cloud · April 2017

I think first you should be testing whether you see a real trend in your data set.

Let's simplify:

x = year
y = complaints

First establish that there is a linear trend over the five years by doing an r-square of complaints as a function of year
Second establish that there is a significant trend as a function of years using a one way ANOVA with a post-hoc linear trend test, this will tell you if your means are significantly different from each other over time

Third make a graph where you represent complaints as a function of PAST violations

x = past violations for a given year
y = current complaints for that year

Then run the same kind of statistics

For blue sky thinking you can make a graph where you plot number of complaints over time (after violation) and then use either natural log fit or e^x to represent a half-life (decay) for complaints to hit zero

Hypothesis 1 and 2 are a little more complicated and I'm not sure how to get at those as well.

Hope this helps.

Akilae · April 2017

You want to do a survival analysis.

Unfortunately this is out of my expertise. You'll have to do some digging on your own on how to run one and how to interpret the results.

Paladin · April 2017

Divide the data by organization and trim and sort the data into groups where hypothesis 1 and 2 have happened. So one organization may have multiple events, or just one, or none.

If you wanted to be simplistic, you could just do chi squared on a binary outcome. A time segment either satisfies the if clause or it doesn't. It also either satisfies the then clause or it doesn't. That's a 2x2 table.

However, since organizations may have more than one testable time segment, that violates independence. Also, each hypothesis requires a regression analysis prior to being tested, which is crazy. I don't see how you will be able to get out of having to do a mixed linear model or bootstrapping.

Blue sky is much easier. Collect only the time segments where (a) violation(s) happens followed by one or more years of no violations. You may want to segment the data into months instead if you don't get enough. Continuous predictor (#violations) and survival outcome = cox proportional hazards regression.

What statistical package can you get?

Marty81 · April 2017

Check out R.

https://cran.r-project.org/

MrTLicious · April 2017

Your hypotheses are very specific, which is good! It means that you could subset your data and do an event study, which is pretty easy statistically.

The problem is that you have a situation in which you can have multiple arrivals of unobserved events, so your hypotheses will bump into each other. Most obviously: Suppose you are in a downtrend in complaints after a violation, and then the organization does something else, triggering an uptick in complaints. Now you have two competing effects - the uptrend from the new event and the continued downtrend from the old event. Ignoring one while trying to estimate the other will lead you to understate the decay/ramp-up.

If you're okay with this bias (e.g. the period between these events is long, you can be sure that future violations are completely independent from previous ones, you're okay with downward bias on your estimates, or some other mitigating factor), then it's pretty straightforward. If not, there may be a way to get around it but that's probably pretty advanced and outside my knowledge base.

For hypothesis 1, organize your data as follows:

Find every time that an organization shifts from 0 complains to >=1 complaints. This is your unit of observation. For that row of data, t = 1 should be the number of complaints in the first year. t =2 is that in the 2nd year, etc, until you reach a violation (whether you include that year is up to you and requires some expert judgment). Repeat this for every observation.

You now have an unbalanced panel on which you can run the simple regression of y ~ t, where y is your number of complaints, and t is the number of years since the initial complaint. You can include several transformations of t to more accurately get at the functional form of the relation. Most likely you can just stick in some higher order polynomial terms (t^2, t^3, etc.) and get a good approximation, but it really depends on the true nature of the data-generating process. You'll want to cluster your standard errors at the event level. How you do this, as well as the actual regression, depends on the software you're using. [R] as suggested above is a great resource and does this all in a fairly straightforward fashion.

Importantly, if these events/organizations are very different, you may need to scale them if you do the regression in levels (I would need to know more to give suggestions on how to do this), or use percent changes if doing the regression in changes. Whether the ideal regression is in level or changes depends on a lot of factors, but most likely you will see some action in both if either is truly causing something.

Example:

Original data:

Complaints:
Organization, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000
1, 0, 0, 2, 4, 6, 7, 5, 2, 0, 0
2, 0, 3, 6, 4, 1, 0, 3, 5, 9, 6
3, 0, 0, 0, 0, 0, 0, 0, 1, 4, 8

Violations:
Organization, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0
2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0
3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Re-organized Data (complaints):
Event, t=1, 2, 3, 4
1, 2, 4, 6, 7
2, 3, 6, ., .
3, 3, 5, 9, .
4, 1, 4, 8, .

Or alternatively (depending on your software),

Event, t, complaints:
1, 1, 2
1, 2, 4
1, 3, 6
1, 4, 7
2, 1, 3
2, 2, 6
3, 1, 3
3, 2, 5
3, 3, 9
4, 1, 1
4, 2, 4
4, 3, 8

For hypothesis 2, everything is the same except the way you organize your data. Here, your observations are violations.

Every time there is a violation, start there (as t = 0). Then you'll want to continue to one of two points: Either the end of your dataset, or when violations go to 0. You shouldn't be mixing and matching these, but choosing one, and which you choose depends on whether you frequently see them go back down to 0 or whether the events are frequent enough that they blend into each other. In the former case, stop the data when you get to 0. In the latter, you'll need to go to the end of the data set, which is going to hit you with a pretty big bias on the decay rate, as it will incorporate the average arrival rate of new problems. In this case, you will also end up using the same data multiple times. That is, just because a data point is used in an event doesn't mean that you need to exclude it from subsequent violation events. So if an organization has violations in 1995 and 1998, then one event might be all complaints from 1995 to data end, and another event will be all violations from 1998 to data end, which is a subset of the previous event. Also note that here it's extremely important to make sure you have a constant term in the regression.

Example (using same fake original data as above):

Event, t=1, 2, 3, 4
1, 7, 5, 2, 0
2, 6, 4, 1, 0
3, 9, 6, ., .

Alternatively:

event, t, complains
1, 1, 7
1, 2, 5
1, 3, 2
1, 4, 0
2, 1, 6
2, 2, 4
2, 3, 1
2, 4, 0
3, 1, 9
3, 2, 6

Penny Arcade

Quick Links

Quantitative Data Analysis

Posts