Summary: Magnitude of correlation when controlling for something

Will Hopkins Fri, 04 Mar 2005 17:11:37 -0800

A couple of weeks ago I wrote to the list seeking input on how best to calculate a correlation coefficient when you are controlling for something: in effect, what kind of partial correlation should you use? See http://sports.groups.yahoo.com/group/sportscience/message/2614 for the original message.

Frank Katch replied with contact info on two statisticians he thought could help, Dave Hosmer and Stan Lemeshow. I contacted them and both very kindly replied, although they didn't resolve the issue for me. Ian Shrier also sought input from one of his senior epidemiologist colleagues (unnamed), and I had some valuable interactions with Ian as well. Their replies appear below.

To revisit the question, consider this example. You are interested in the effect of physical activity on health. You do a cross-sectional study in which you measure health, physical activity, and various other things that you know you ought to measure, because they might also predict health, and anyway, other people measure them so you'd better, too. In particular, you measure socioeconomic status (SES) and find that SES and physical activity are both positively correlated with health. Further, you find quite a strong correlation between SES and activity. (A substantial correlation between predictor variables is called substantial collinearity, by the way.)   Now, people on high SES eat good food, live in toxin-free classy parts of town, read Time, and think they're alpha in every way. All these things could account for their good health. Oh, and they do a lot of good-quality deliberate exercise, but that might have nothing to do with their good health. It's all those other things that go with high SES. How do you analyze your data to address this potential for the effect of activity on health to be "confounded" by SES? By doing a multiple linear regression, of course. The effect of activity in a multiple linear regression that includes SES is the effect of activity "controlled for" SES; that is, the effect with SES effectively held constant. But how do you express the magnitude of the resulting effect of activity on health? That's was the substance of my query to the list.

I finally answered this question to my own satisfaction by doing some simulations. I generated two predictor variables, X1 and X2, that were like SES and activity: correlated with each other to some extent that I could change, and correlated with a dependent variable Y, like health, to some extent that I could also change. I threw in an additional predictor X3 that was also correlated with Y but uncorrelated with X1 and X2, just to keep track of how that kind of variable behaved in such an analysis. I won't say any more about that one, other than it gave the right correlations in the multiple linear regression.

You have two choices for interpreting the magnitude. You use either the regression coefficients (the terms in the multiple linear regression that convert values of the predictors into values of the dependent variable) or correlation coefficients. It's hard to get a good idea of magnitude from the regression coefficients without invoking Cohen's concepts in some manner. In other words, the between-subject standard deviation (variation) in the predictor and dependent have to come into the story. Correlation coefficients already have between-subject SDs built in, so they are good candidates for interpreting magnitude. The correlation coefficient for a given predictor in a multiple linear regression is called a partial correlation coefficient, but you can calculate it in several ways. More about that in a moment.

My simulations showed that the regression coefficients are bad measures for gauging magnitude when there is substantial correlation between X1 and X2. I pushed things to the limit to get a clear picture by making X1 and X2 exactly the same, apart from random noise.    The more noise I added, the worse the correlation between the two. Because X1 and X2 were effectively the same, they both had the same correlation with Y in reality, but of course, in any sample their correlations with Y would differ a little because of the noise. I found that the regression coefficients for X1 and X2 on average were identical in the multiple linear regression, as they would have to be. Further, they were half the value that either had in the multiple regression when the other was not included in the model. That makes sense too. But note that the value with both in the model is not zero. It is half what either one has on its own. Interpreting the regression coefficient for, say, X1 controlling for X2 would therefore be misleading, because it would give you the impression that X1 had half its effect in the presence of X2. But X1 and X2 are measuring the same thing, apart from noise, so whatever measure of magnitude you use for X1 in the presence of X2, it should be zero, not half the value when it's on its own. There is also an issue about precision of the estimate of the regression coefficients when you have strong collinearity, but that's not an issue here. Others will disagree.

The partial correlation coefficients did much better, although they weren't perfect. First, it was obvious I should use what SAS calls a Type II (or simultaneous) partial, which means the partial correlation when you have controlled for any confounding effects of the other predictors. When I had little noise in the values of X1 and X2, and therefore a high correlation between them, the partial between Y and X1 with X2 in the model was near enough to zero, and vice versa, which is the right answer. Partial correlations beat regression coefficients.

When there was more noise in the relationship between X1 and X2, and therefore a lower correlation between them, the partial correlation for one of them in the presence of the other started to creep up from zero. The poorer the correlation between X1 and X2, the bigger the partial for any one of them. This occurred even though X1 and X2 were measuring exactly the same thing in Y--I made sure of that in the way I generated the variables. In other words, noise in a predictor reduces your ability to fully control for its confounding effects in a multiple linear regression. I knew all that from way back, but it was good to be reminded. There is an important consequence for interpretation of studies of population health: be skeptical about reports that state things like "even after we controlled for SES, diet etc etc, there was still an effect of activity on health". It may be that there isn't actually any substantial effect of activity on health in reality, once you control for those other things, because lifestyle predictors are often noisy, so they don't control properly.

But my original question, about which partial correlation to use, is still unanswered! You can calculate the correlation as a semi-partial, which means effectively that the magnitude of the effect is interpreted using the SD of the original dependent variable. Or you can calculate it as a (full) partial, which means you effectively use the SD of the residual variation in the dependent variable after all the other predictors have been taken into account. The partial will be larger than the semi-partial, because the partial is the expected correlation for subjects all with the same SES, diet etc etc. Is it the right one to use? Dunno. It looks bigger, so if you want a big correlation, use the partial, not the semi-partial? But if you want your answer to be little or no effect, use the semi-partial, not the partial? That's not good science.

A further important point is what I call the danger of throwing out the baby (activity) with the bathwater (SES and the other predictors in the model). In the example, controlling for SES left nothing for activity. But that doesn't mean activity isn't important. It just means that its effects go along with SES. So make sure you look at the correlation between the dependent (health) and each predictor (activity, SES...) in its own--the simple raw correlations, in other words. If you find the correlation for activity is substantial on its own but drops to near zero when you control for the other variables, your conclusion is that activity could still be important, even though it is accounted for by the other predictors. And if you find that it is still important after you control for the other predictors, and some of them are noisy, make sure you alert the reader to the possibility that the controlling might not be that good.

The person for whom I initiated this enquiry also wanted confidence limits for the partial correlation.   Hmmm... Stats packages probably won't give you that, but they will give you a p value for the predictor controlled for all the other predictors. To convert that p value and correlation coefficient into confidence limits, convert the correlation to a Fisher z using the FISHER() function in Excel. That statistic has a normal sampling distribution, so put it and the p value into my spreadsheet for confidence limits, the bit at the top that deals with normally distributed variables. Make the degrees of freedom something large, like 1000. Then convert the confidence limits back into correlation coefficients using the FISHERINV() function in Excel. Voila. You can do this for either the part or the partial correlations. Naturally you then interpret the confidence limits in relation to substantial values. Or use the chances of benefit and harm in the confidence-limits spreadsheet. The Fisher z transform of the smallest worthwhile correlation is effectively the same as the smallest correlation, 0.1.

Will

Here are the replies I got, edited a little. I disagree with most of the points in these replies, but I am not really certain about this stuff.

From: Dave Hosmer

In your example there is probably an interaction (effect modification) so adjustment (confounding ) is not relevant. You can see Hosmer and Lemeshow, Applied Logistic Regression, Second Edition for a discussion of this in the context of a logistic regression.

As for coefficients the model must be fit on data that has been standardized (data - mean)/SD)) to be able to compare coefficients. The regression coefficient provides an estimate of effect holding all other model covariates constant, unless there is an interaction. Personally I hate correlation's of any type and use coefficient based estimates of effect.

From: Stanley Lemeshow

Perhaps these few pages of overheads I use in my class will be helpful to you. The program I used in the notes is Stata. A good reference is the book Applied Regression Analysis and Multivariable Methods - 3rd Edition by Kleinbaum, Kupper, Muller and Nizam
[From what Stan had in his overheads, it looks like he uses the partial, not the semi-partial. He gave no rationale, however.]

From: Ian Shrier <[EMAIL PROTECTED]>

Okay, I have an answer from a senior epidemiologist who is very well respected at our institution. I will later get a statistician's viewpoint.

In epidemiology, we focus on the coefficients and the confidence intervals for the coefficients. The fitting of the overall model is important, but not the partial correlation coefficients.

From an epidemiological perspective, correlation coefficients are not useful. First, in a regression model, X predicts Y. Correlation coefficients are based on the premise that there is a correlation between the two but not that one predicts the other. Second, the concept that correlation coefficients explain the variance is only internal. As samples vary in study to study, the "explaining" would vary. More importantly, the total variance would vary from sample to sample and therefore the amount of explaining is not very helpful. In other words, the correlation coefficients are not transferable from sample to sample but are helpful in model selection because that is internal to the one study.

I will try to see if the statisticians agree.

[later]...The statistician I spoke to deals mostly with model selection. And he is a Bayesian, so he uses Bayesian techniqes. He did say that the partial correlation coefficients will probably only be meaningful if every variable is normalized so that the variances are normalized, if I understood him correctly... ...Conceptually, I agree with the epidemiologist approach. look at the coefficients and confidence intervals. which variables belong in the model (i.e. model selection) must be much more than numbers and include causal pathways and an understanding of mechanisms. overall model fitting is important, but your way of looking at partial correlations do not seem to help me much.

Post messages to [EMAIL PROTECTED] To (un)subscribe, send any message to sportscience-(un)[EMAIL PROTECTED] View all messages at http://groups.yahoo.com/group/sportscience/.

Yahoo! Groups Sponsor

Yahoo! Groups Links To visit your group on the web, go to: http://groups.yahoo.com/group/sportscience/ To unsubscribe from this group, send an email to: [EMAIL PROTECTED] Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

Summary: Magnitude of correlation when controlling for something

Reply via email to