Talka
Registered User regular

I remember the textbook definition of correlation coefficients from college: it's a measure of the covariance between two variables, telling you essentially how closely those variables move hand-in-hand.

What I've always wondered is this: is there an easy way of describing a correlation coefficient with real-life examples?

Let's say you have a dataset and see two binary variables with a correlation coefficient of 0.6. Can you say that changing ten of one of the variables will change six of the other variables?

My first reaction is to say no, you can't infer causation from correlation. But what if you really knew the relationship was causal? Then is this what a correlation coefficient of 0.6 means? Change ten of one, you change six of the other?

I get the feeling this is a misinterpretation of what the correlation coefficient is, but I'm not sure. If it's wrong, can anyone give a layperson's example of what the correlation coefficient means? To the extent that it's possible, I'm interested in what it means for what to expect when you start changing the data set after measuring the correlation.

What I've always wondered is this: is there an easy way of describing a correlation coefficient with real-life examples?

Let's say you have a dataset and see two binary variables with a correlation coefficient of 0.6. Can you say that changing ten of one of the variables will change six of the other variables?

My first reaction is to say no, you can't infer causation from correlation. But what if you really knew the relationship was causal? Then is this what a correlation coefficient of 0.6 means? Change ten of one, you change six of the other?

I get the feeling this is a misinterpretation of what the correlation coefficient is, but I'm not sure. If it's wrong, can anyone give a layperson's example of what the correlation coefficient means? To the extent that it's possible, I'm interested in what it means for what to expect when you start changing the data set after measuring the correlation.

0

## Posts

Example set 1:

Population size vs. number of hospitals are going to have a very high positive correlation (near 1). As your population increases, you need to build more hospitals to take care of them. Population size vs. empty houses will have a very high negative correlation. As your population increases, people need places to live in, so the number of empty houses decreases. Population size vs. number of sunny days will have a correlation near 0, because there is no relationship between these two variables.

Example set 2:

Population size vs. number of hospitals is going to have a very high population, because increasing your population causes you to build more hospitals to take care of them. Correlation implies causation in this case. Number of car theft vs. number of hospitals will have a very high correlation, because both are correlated to the same underlying variable (population size). Correlation is due to causation from the same variable, but there is no causation link between those two variables. Number of ships sailing through the Suez Canal vs. number of hospitals is going to have a very high correlation (maybe), but there is no causation whatsoever there. It's a complete coincidence.

The value of the correlation means how much the variables vary together. A value of 1 means they vary exactly together by the same amount: if you plot one vs. the other it will be a straight line. 0 means they do not vary together at all: if you plot one against the other it will be a circular cloud. Anything in between is a kind of elliptical cloud, that gets wider and more circular as you get closer to 0 and narrower and more linear as you get closer to 1.

Same thing going from -1 to 0, but with the line in the opposite orientation.

Here's an example I just thought of that may be more clear than my original post:

Let's say you buy a huge barrel of apples, and you measure a) whether they're red and b) whether they're delicious. The correlation coefficient for these two variables is 0.6.

Then let's say you replace ten non-red apples with ten red apples. Did you just add six delicious apples to your barrel?

I'm thinking not necessarily, but want to be sure I'm understanding this correctly.

TalkaonNot necessarily. As I pointed out, there are three cases:

1. Red apples are more delicious. In that case, and if the correlation was 1, then yes, replacing non-red apples with red apples would increase the number of delicious. With a correlation of 0.6, you're helping increase the deliciousness, but not to a 1-to-1 ratio like you described.

2. The red apples you sampled initially might be of a different and more delicious species (i.e. deliciousness and redness are related to species). In that case, changing the non-red-apples to red apples will only help if you are actually replacing them with apples of the correct species.

3. Colour has nothing to do with deliciousness, and the relationship you found in your original sample is a complete coincidence. In this case, replacing non-red apples with red apples won't do anything.

I wasn't describing a 1-to-1 ratio though, right? Wasn't I describing a 0.6-to-1 ratio? Is that any more accurate?

This will help you understand better. It's the relationship between two variables with correlations from -1 to 1, at 0.1 intervals.

Say that the X axis is redness and the Y axis is deliciousness. Looking at your example at 0.6, it's rather like a diagonal-oriented cloud. Increasing the redness (moving to the right on the axis) doesn't automatically lead to an increase in deliciousness. But in leads you to a part of the cloud where you're higher on the Y axis (more delicious) on average.

Richy gave a good explanation, but I'll try to give another one using your example.

Like Richy said, you have to be careful when talking about "replacing" apples: you would have to replace the apples with the EXACT SAME STOCK of red apples at the EXACT SAME RATIOS as you did before (or at least within a certain acceptable error), and that might be difficult. The best way to do this would be to replace the apples with a *random sample* of red apples, but again this would be difficult at only ten apples at a time as you'd be likely to get sampling error. (If you were to replace 10,000 apples, sampling error would be much less or a problem!)

This is one of the basics of statistics/econometrics: you have to be very careful when discussing correlation vs. causation. Generally speaking, the only time correlation can imply causality is when you have a properly randomized experiment, and even then it can sometimes be iffy.

Anyway...

The thing about correlation coefficient is that it only measures DIRECTION/SIGN (i.e. positive or negative) and RELATIVE PROBABILITY. Refer to the image that Richy posted earlier:

Looking at each of these graphs, pretend that the Y Axis measures "red delicious apples" and that the X Axis measures "red apples." A correlation coefficient of 0.6 would tell you two things: 1) if you were to draw a line of best fit, it would have a positive slope 2) the dots are semi-clustered around that line. Importantly, IT WOULD NOT TELL YOU THE MAGNITUDE OF THE SLOPE OF THE LINE OF BEST FIT.

In layman's terms, all a correlation coefficient of 0.6 tells you is that an increase of red apples is *often* (but not always) associated with an increase in Red Delicious apples. But we're not sure even how large that increase would be: it could be 1 Red Delicious for every 10 red, or 9 Red Delicious for every 10 red.

Next, let's say that the correlation is 1.0. As you can see, that's a perfect line of best fit - so for every A increase in X (red apples) you will always get B increase in Y (Red Delicious apples). In other words, here you can say that an increase of red apples is *always* associated with a simultaneous increase in Red Delicious apples. But again, we still cannot discuss magnitude because we still don't know the slope of the line of best fit.

Finally, let's say that the correlation coefficient for "Red Delicious Apples" is 0.6, and the correlation coefficient for "Fuji Apples" is 0.3. Here, we can say that an increase of red apples is often associated with an increase of both Red Delicious and Fuji apples. However, you are *more likely* to see an increase in Red Delicious apples than you are to see an increase in Fuji apples. Yet again, however, this says nothing about magnitude: the line of best fit for "Fuji apples" might be much steeper than the line of best fit for "Red Delicious apples." The big difference is that between the two graphs, the dots are more clustered around the line of best fit for "Red Delicious" than they are for "Fuji."

As you have probably guessed at this point, the key missing value here is the SLOPE OF THE LINE OF BEST FIT. If you know the slope, then you can make a definitive statement about magnitude. If we know that the slope is 0.6 (and that it's statistically significant), THEN we can say that an increase in 10 red apples is associated with increase in 6 Red Delicious apples. You can get slope of the line of best fit by running a linear regression, but that's a whole 'nother lesson

ChopperDaveonTalkaonSo let's say I measure 100 variables. I take one of those variables, and find the correlation coefficient between it and all the other 99 variables. I then sort these 99 variables according to this correlation.

In the past, I'd have taken the top ten variables in this list and said "these are the variables that correlate most closely with the one key variable. If you want to affect this one key variable, these ten most highly correlated variables are critical." I understood correlation does not imply causation, but thought this was still the best (though flawed) way of understanding the data.

From what Richy and ChopperDave have explained, this is problematic, and not just because correlation does not imply causation. The scatter plots for the top ten most highly correlated factors could form perfectly flat lines, yes? So they could have a correlation coefficient of 1.0 but a slope of, say, 0.001. Perfectly correlated, but irrelevant.

Is this correct? What should I be doing instead?

Well, there are a couple problems with this sort of approach.

For one, things can be highly correlated without having any causal link. For example: if you were to regress pre-tax income for 100,000 people on post-tax income for the same 100,000 people, you'd find that their R-squared value (R=correlation coefficient; in practice most statisticians use R-squared because it gives you a spectrum of 0.00 to 1.00) would be 1.00 and that they are perfectly correlated. But obviously, pre-tax income doesn't have any practical causal effect on post-tax income, the tax rate does.

You can think of other similar examples of things that would produce high R values. Let's say you ran a regression of individual standardized test scores in 2011 vs. individual standardized test scores in 2012. Again, the R-squared would be close to 1.00, but I seriously doubt that having low test scores in 2011 would "cause" a child to have low test scores in 2012. Something else might produce an R-squared that's much lower but could have more of a causal effect. Let's say that you ran a regression of race on individual standardized test scores in 2012 and found that the R-squared was around 0.40. Would we say that a student's race is a less important influence on his standardized test scores than his test performance in 2011? No! Your common sense should tell you otherwise.

The second problem with this approach is that there might be variables that are jointly correlated. You might find that if you regress mothers' education level on their children's IQ, R-squared would be near 0, and that if you regress fathers' education level on their children's IQ, R-squared would also be near 0; but if you ran a joint significance test, you'd find that regressing mothers' and fathers' education levels TOGETHER on their children's IQ might have an R-squared of 0.4. In other words, parental education level might seem irrelevant if you look at each parent individually, but it'll be much more relevant if you look at both of them together.

Well, this isn't really the right way to read that. If your correlation coefficient were 1.0 and your slope were 0.001, then what'd you'd be saying is that each additional 1000 apples will almost always yield 1 additional Red Delicious apple, on average. That's not really "irrelevant" unless you really like Red Delicious apples ;)

But let's think about the opposite situation. Let's say you run a regression of red apples on Red Delicious apples. (What you'd effectively be asking here is "What is the probability that if I add a red apple, I will also be adding a Red Delicious?") Let's say that the slope of the regression turns out to be 0.2 and it is highly statistically significant, but the r-squared turns out being 0.001. What does that mean?

Well, it means that there's about a 20% chance that by adding a red apple you'll also be adding a Red Delicious, or for every 10 red apples 2 will be Red Delicious. But we've also discovered that there's no real correlation between adding a red apple and a Red Delicious. How could THAT happen? Well, let's say you lived in a world where 99.9% of all apples were red. That means that you could add virtually ANY apple regardless of color, and there would still be a 20% chance that you'd be adding a Red Delicious. In other words, the fact that you've chosen to screen out red apples here is practically irrelevant, and you'd probably get the same results if you just added an apple randomly.

OK, this is going to require a basic understanding of econometrics, but I'll explain it the best I can.

Basically, you shouldn't be using R or R-squared to determine causality at all, because that's not what R-squared is good for. What R-squared is good for is demonstrating something's EXPLANATORY or PREDICTIVE POWER, which is a different concept.

Go back to my previous example with the apples. What we demonstrated there is that that choosing to add only red apples doesn't really affect the probability that new apples will be a Red Delicious in any practical way. We had a regression slope that was statistically significant and perfectly true, but didn't have any practical explanatory value -- you might as well have been selecting apples randomly.

The "correct" way to use R-squared is to run it on multiple variables at the same time. (Which is perfectly possible if a little mathematically complicated, so I won't explain it here.) That way, you can see if adding a variable into a regression will have any added explanatory value.

Let's say we wanted to figure out what affects the average student's standardized test scores. We run a regression of race on test scores, and find that the R-squared is 0.40. Ok, so we've learned that race has some explanatory power. Now, we might run a regression of race AND gender on test scores, and find that the R-squared for the entire regression is now 0.60. This is great -- now we know that taken together, race and gender has significant explanatory power. Then we might add in parental income and find that the R-squared for the entire regression is now 0.75. Even better! Then we might add in "eye-color" and find that the R-squared remains at 0.75. Oops -- looks like eye color doesn't really have any explanatory power here, and can be left out of the final regression. (Adding variables into a regression cannot decrease R-squared values, they only increase the R-squared or leave it the same.) All in all, we can say that if we know a student's race and gender and the income of his/her parents, we can predict his/her likely test score with fairly high confidence.

Again, we should be careful not to confuse "explanatory/predictive power" with "causality." I could throw in a variable for "standardized test scores 2011" and that would shoot R-squared up to 1.00 -- in other words, if we have a student's test score in 2011, we can predict his test score in 2012 with extremely high confidence. But that doesn't mean that 2011 test scores (or any of the other variable) necessarily have any causal effect on 2012 test scores.

In sum, the value of R-squared is that in combination with a regression slope, it can tell you whether a variable has any practical significance. The lower the r-squared, the less predictive value a variable has (and thus, the less likely any significant causal relationship can be demonstrated). Statisticians essentially use it to "weed out" unnecessary variables from their regressions, like my "eye-color" example above.

ChopperDaveonI'll mull this over tonight to see if I have any questions.

Also, R-squared is a measure of how much a predictor explains the variation of your dependent variable. I suppose "explanatory power" is a great non-math way of looking at that, but the true definition I think makes things more clearer. The example used though is very good.

It should be helpful to remember that with regression, you are saying that of the entire population represented by your dependent variable, there exist "levels" (independent variable values) such that knowledge of what level you are at, generates a "different" distribution for your dependent variable. If two variables were perfectly correlated, then every data point should line on a straight line; however if some points do not lie on your line of best fit, then some variation of your dependent variable is not explained by your independent variable (which is what R-squared measures). So with an R-squared of .75, 75% of the variation observed in the dependent variable "test-score" is explained by race, gender and parental income.

Here is a graph visualizing what is going on (pay no attention to the numbers or symbols):

It is from a great book on regression (simple and otherwise) by Kutner; the example is looking at the the number of bids (contracts) prepared by a company and the average number of hours expected to complete those contracts. The curves you see for X=25, and X=45, are the distribution of Y|X (Y given X).

Usually, examples on linear regression will just give you the scatterplot and the line of best fit, which is fine - but I think it misses a lot of what is actually going on (namely that a regression equation gives you a MEAN response, not an exact value given some independent variable).