So, hopefully someone can verify what I'm suspecting with this so far.
I've got data. Lots of it. Millions of observations. And right now I'm about ass-deep breaking the data down into subgroups. The problem I'm currently running into is distribution. These are costs (broken into two separate, mutually exclusive cost groups) so all observations are non-negative. They also range from a few cents all the way up to multi-million dollar expenditures. The data is incredibly skewed right, with thousands upon thousands of "small" purchases being countered by a few extraordinarily "large" purchases. The data is also very apparently non-normal.
At this point, I've broken the data up into a few pertinent subgroups. I ran some quick ANOVA tests to try and find reasonable groups to aggregate what data I could. What I suspect, however, is that I'm going to need to go further or in a different direction. I've run some of my sample data through some various distributions and transformations (Weibull, lognormal, exponential, Box-Cox, and Johnson notably) and I'm still dealing with P-values < .05. I'm considering investigating a Pareto distribution, but I've more or less ascertained at this point that I still have multiple populations muddled together, though if anyone has any input, either to verify my suspicions, to throw out some other heavy-skew-friendly distributions, or otherwise, I'd love to get your insight. Perhaps breaking the expenditures up into their own populations? I can potentially aggregate some of the tiny expenditures by expense type, as well. Input is more than welcome!
Posts
It sounds like "is the distribution of values in this group of random non-negative numbers I have similar to a standard probability distribution?" which isn't a very interesting question.
If you really are dealing with different populations, then I don't think relying on statistical tests to tell you what they are is a very good way to separate them out. Your instinct to look at different categories of expenses is a very good starting point.
I'm heavily leaning towards grouping cost data at this point, but that will be a lengthy, difficult process for a variety of reasons having to deal with the database I'm working from. The normality (or lack thereof) of that data isn't so much the end in sight; rather, just the obstacle I'm currently dealing with before I can move forward with the various other things I need to do with the data. The organization of the database and defining the populations are the trickiest parts at the moment.
That rigorous stuff is great. Just looking at the data and getting a feel for it is better. Try to 'know' the answer intuitively before you 'know' it analytically.
I'm considering grouping same/similar expenditures to try and smooth out the observations a bit, as generally speaking (though not always) the same expenditures are not found in amounts greater than a million or so, and I believe the "step" pattern is really throwing off the trend at this point.
what are you trying to answer?
You said you're trying to get the "ratio" of purchases? What does that mean? What is the end goal of your analysis? I really can't help you until I know what you're trying to achieve.
All of this.
You're talking about p-values, but it's not clear what the p-value you're talking about getting is actually testing - what is your hypothesis? What are you trying to do? Are you just trying to transform the data to a normal-ish distribution to do tests on?
There are plenty of non-parametric methods that work fine with data that doesn't fit a distribution well, but we'd have to know what you're actually trying to accomplish with the analysis
What I am trying to do, for this portion, is find out if there is an ideal amount of preventive expenditures in relation to corrective expenditures, in order to minimize both costs.
The P-values I've been testing so far have simply been on fits for the distributions and transformations I've been running; I'd need a P-value greater than .05 to be able to state with confidence that the distribution is a good fit for the data. As yet, I've not found one. I'm more than willing to start trying some non-parametric inference, however.
Hopefully this cleared things up a bit, but if not feel free to ask and I will respond as soon as I am able. I appreciate all the responses so far.
So, this would be true if you had good reason to believe that your data would fit a given distribution to begin with, and you performed one test after carefully examining the population and choosing your criteria, etc. What you've done is thrown a ton of tests after a bunch of data points that you're not sure even belong to the same population, so if you happen upon a p-value greater than .05, that doesn't mean anything - you can't make a statistical inference from that.
What you should do is throw out all the tests you've done so far and take a long look at the data before testing, and decide what's important. Do machines in certain locations differ, machines for different purposes, machines with different maintenance contracts, warranties, etc? Then test it. If the data still doesn't match a convenient distribution... you may just be out of luck.
How do you plan to correlate the preventative and corrective expenditures anyway? You'll have to arrive at some formula that says coefCM = constant - coefPM except not linear (because that would imply you should spend all your money on preventative maintenance) and probably more complicated. I'm not sure how knowing the distribution of either preventative or corrective maintenance expenditures would help you figure this out... but that might be because I've forgotten too much math.
Not exactly answering your original question, but don't forget that there are other costs to corrective maintenance other than the cost of the maintenance.
From how you're describing the data, it sounds like you need to do some "cleaning" beforehand to see if there is even a little bit of support for your hypothesis. For example, categorizing each expenditure as either preventative or corrective, then seeing what comes when you compare the two types of expenditures.
Unfortunately if you're working with "real-world" data, I don't think you're going to get the answers you need without even more such work, because there's no way that you can compare the two types of expenditures while controlling across all the necessary variables.
As far as the modelling of the correlation, if it gets to that point: knowing the distributions won't directly solve the relation of preventive vs corrective, but there are some models that deal with this specific issue the data can be run through at that point to see what comes out. Getting the data into a workable form is a prerequisite, however.
Serpent and Inquisitor, thank you for the heads-up. I did, however, spend quite some time segregating the cost data to a point where what I'm looking at are the appropriate cost numbers, rather that just total maintenance cost data.
First, you have millions of observations, so the benefit of normalizing the data will be minimal (in terms of getting significant pvals). Second, you have already told us that the data can't properly be modeled as a continuous distribution (the steps you observed). Third, you can't normalize the data anyway.
That said, I have no idea how to solve your problem non-parametrically. Possibly because I don't understand it at all.
Understood. To be more clear, the tests I'm running (when the groups are broken up into like categories) are more on a few hundred thousand observations, but your original point still stands. After playing around with the data some more this morning I'm beginning to likewise suspect that non-parametrics are going to be the most reasonable method of inference I have available to me right now. If nothing else, I can probably get some point estimates out of it and go from there. I understand the issue is a little bit hard to understand, and I wish I could be more helpful in the explanation. A combination of having to limit how descriptive I can be, and this being only a single cog in a much larger project makes it difficult for me to be as forthcoming as I wish I could be.
For right now, however, obtaining some descriptive statistics is a good starting point.
As a sidenote I really do appreciate everyone's responses so far, and am grateful.
-Preventative maintenance is less costly over the long term than corrective maintenance
If this is true, then working from an ideal methodology, you would have two sets of machines that are exactly the same, sitting next to each other in the exact same locations, undergoing the exact same type of workloads and conditions as each other. Then for one set you would preferentially use preventative maintenance while for the other you would use corrective maintenance. After a sufficiently long period of time (most likely based on the average expected lifetime of any given machine), you would compare the two sets of machines.
Even under this ideal scenario, the amount of statistics involved is extremely small. You could probably get away with comparing one variable (the average expenditures of each set of machines) against each other and seeing whether those differences are statistically significant. You can get slightly fancier by performing more regressions, but at the end of the day it's a very, very simple comparison.
So, taking your actual data into account, you need to parse the data in such a way that you can control for things like location, model type, age, etc. in order to get a comparison of the expenditures. So instead of throwing all the numbers into a regression and seeing whether various differences are "statistically significant", what you need to do is pick precisely which variables you are controlling for, assign your populations accordingly, and then test to see if there is anything you need to control for. [This is what I meant when I said that you need to "clean" the data. You need to filter out all the noise by figuring out what factors are and aren't important - throwing random variables together into different types of regressions isn't going to cut it.]
You may actually get more benefit from moving away from your initial hypothesis and taking another tack to the problem. For example, comparing two specific types of mutually exclusive maintenance and comparing them against each other. Assuming the sample size is large enough and you feel that things like location, model, age, etc. are sufficiently controlled, those types of comparisons will be both easier to do and much more difficult to undermine. They also move you away from making specific assumptions or value statements about one "type" of maintenance over the other - you're just comparing two specific tasks. You also avoid a rather large obstacle in that I would assume that the vast majority of machines would have had both types of maintenance performed on them over their lifetimes, which makes most sorts of meaningful claims on any statistical differences from your data pretty hard to make.
Thank you for this, I found it to be very helpful. So, would you say a more proper avenue of attack would be something such as find two machines of the same type, control for its age, location, and other such variables, and then compare, say, preventive maintenance typed as "Painting" on one with the cost of corrective maintenance typed as "Painting" on the other? I could also potentially compare costs on the same machine, but I figured that wouldn't be very statistically significant given that if it had preventive painting costs a year prior its corrective painting costs the next year would clearly be skewed.
And yes, I wish the dataset I was working from were much more complete and thorough; however, I've been told basically "It is what it is" and that it's what we have to work with. If I need to go back and in the end say "here are my base observations, but with such a short time range and with some many variables it's hard to pull out anything of more statistical significance than this" then that is acceptable, I just need to be able to document that (and the steps I've taken) thoroughly.
I have very limited experience with statistics, but I'd think a large regression model including all of the expenditure data and subsequent dummy variables would be a good start for identifying which variables have a statistically significant difference, but then it kind of seems like you've done that with the ANOVA tests on the different categories you already identified.
On a bit of a sidenote and not what you're asking for..
What I meant was the operational impacts of correctional maintenance.
For example, in a coal bulk transportation system, corrective maintenance on a train set means the train is not transporting coal. This coal is not getting from the mine to the port. It is sitting on the ground at the mine and must be pushed around to avoid spontaneous combustion. In addition, ships could arrive at the port and be forced to wait for the coal to arrive which racks up demurrage charges. On the other hand, ships and coal production could be scheduled around planned maintenance and these costs can be avoided.
edit:
Very glad you've found this thread helpful... this stuff is pretty related to my job and I'm finding it useful too! Thanks for starting it up.