The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.

Statistics Question

MikeManMikeMan Registered User regular
edited February 2010 in Help / Advice Forum
Calling out to all math stats types.

Can you please give me a concise picture of the difference between a population variance and sample variance and why that difference means the sample variance is computed by dividing by n-1, and not n like in the population variance?

The textbook is being unbelievably vague about it, just spouting formulas. It also goes on about degrees of freedom and bias and I am totally lost. Wikipedia is absolutely no help either.

I just want to understand conceptually why they are computed differently, considering the sample is just a subset of the population.

MikeMan on

Posts

  • cncaudatacncaudata Registered User regular
    edited February 2010
    When you compute a sample variance, what you're really trying to do is estimate the variance of the population. Because you know you aren't going to estimate perfectly, you want to be conservative in your estimate. Conservative in this context means stating the variance as larger than you otherwise would, since most conclusions you want to reach would be helped by a smaller variance. So, you divide by a smaller number than you otherwise would.

    That's the justification for it. I cannot even come close to remembering the math behind the justification 8! years after learning it.

    cncaudata on
    PSN: Broodax- battle.net: broodax#1163
  • vonPoonBurGervonPoonBurGer Registered User regular
    edited February 2010
    That n-1 is known as Bessel's correction. Now you may be thinking "Correction?? That sounds a lot like a fudge factor." You would be correct. Statistics is all voodoo math, man, it's just highly useful voodoo math with some really good justifications behind it.

    vonPoonBurGer on
    Xbox Live:vonPoon | PSN: vonPoon | Steam: vonPoonBurGer
  • SeptusSeptus Registered User regular
    edited February 2010
    I'm going to guess that a sample size of 30, being the approximate low end to reach a normal distribution, is yet another example of said voodoo.

    Septus on
    PSN: Kurahoshi1
  • ClipseClipse Registered User regular
    edited February 2010
    The best informal and fairly intuitive explanation for Bessel's correction is mentioned briefly by the wikipedia article: when you use a sample mean to compute a sample variance you are effectively removing one degree of freedom. When computing population variance, your "sample" is the entire population, and so your "sample" mean is in fact the actual mean of the distribution. It's easy to see, modifying the calculation on the wikipedia article, that the n/(n-1) fudge factor is not necessary if you can use the actual mean of the distribution rather than a sample mean.

    Clipse on
  • CrystalMethodistCrystalMethodist Registered User regular
    edited February 2010
    Wow, that's really ghetto, I didn't know that myself.

    Basically, it's an unbiased estimator (as you get more samples, you converge on the correct value) and it's slightly higher *depending on n* (which is a related point).

    So if n=5, you multiply by 5/4 = 1.25, since you don't have a lot of samples. If n=1000, you multiply by 1001/1000 = 1.001, which is much smaller. So it's a function of n that decreases non-linearly as you get more data. Obviously as n goes to infinity, (n/(n-1)) will approach 1, so you won't be correcting anymore, which is why this is an unbiased estimator.

    Basically: it has all of the properties that a fudge factor should have, and it works decently in practice. Voodoo math but with a little reasoning sprinkled on top to taste.

    CrystalMethodist on
  • GoodOmensGoodOmens Registered User regular
    edited February 2010
    The basic idea is this: within the population there will be a few examples of extreme values. For example, if you're talking about height, there are a small number of very tall people, and a small number of very short people. These extreme values tend to increase the standard deviation and variance, because they are far from the norm.

    OK, so take a sample from the population. Grab 25 people at random. Chances are very good that you won't get a super-tall or super-short person because they're very rare. So your sample will have values closer to the average, so the sample variance will be artificially small as a result. It won't match the population variance, which is what you're trying to do. To account for that, you divide by n-1, because dividing by a smaller number results in a larger result in order for them to match.

    There are more complex methods to account for that difference, but the n-1 works well for most situations.

    GoodOmens on
    steam_sig.png
    IOS Game Center ID: Isotope-X
  • musanmanmusanman Registered User regular
    edited February 2010
    It just works.

    musanman on
    sic2sig.jpg
  • SavantSavant Simply Barbaric Registered User regular
    edited February 2010
    That n-1 is known as Bessel's correction. Now you may be thinking "Correction?? That sounds a lot like a fudge factor." You would be correct. Statistics is all voodoo math, man, it's just highly useful voodoo math with some really good justifications behind it.

    This page goes into the meat of the matter.

    One reason for "why n-1?" is that you are using the sample mean, xbar = (x_1 + x_2 + ... + x_n) / n, to compute the sample variance, instead of the true mean. Bessel's correction just makes it so the sample variance is an unbiased estimator of the true variance, which means the expected value of the estimator is equal to the true value that is being estimated.

    Here is another derivation on wikipedia showing that the corrected sample variance is an unbiased estimator:

    14d405be16bb3eb29ef046c9544c96e1.png

    Note how the variance of the sample mean has to be taken into account when computing the overall expected value of the estimator.

    Savant on
  • ApexMirageApexMirage Registered User regular
    edited February 2010
    MikeMan wrote: »
    It also goes on about degrees of freedom and bias and I am totally lost.

    Degrees of freedom apply when your sample size is under 30. You then use n-2 instead of n-1, and another table comes into play for calculating Z scores, which is considerably smaller and easier to use.
    I don't know of the actual explanation behind it, but from the other replies I'd deduct it's to further compensate for the possible error.

    ApexMirage on
    I'd love to be the one disappoint you when I don't fall down
  • MikeManMikeMan Registered User regular
    edited February 2010
    Hmm. These explanations seem to be making sense. Thanks all.

    MikeMan on
Sign In or Register to comment.