Re: [R] Prediction with multiple zeros in the dependent variable
On 08-Sep-05 John Sorkin wrote: I have a batch of data in each line of data contains three values, calcium score, age, and sex. I would like to predict calcium scores as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately the calcium scorers have a very ugly distribution. There are multiple zeros, and multiple values between 300 and 600. There are no values between zero and 300. Needless to say, the calcium scores are not normally distributed, however, the values between 300 and 600 have a distribution that is log normal. As you might imagine, the residuals from the regression are not normally distributed and thus violates the basic assumption of regression analyses. Does anyone have a suggestion for a method (or a transformation) that will allow me predict calcium from age and sex without violating the assumptions of the model? Thanks, John From your description (but only from your description) one might be tempted to suggest (borrowing a term from Joe Shafer) a semi-continuous model. This means that each observation either takes a discrete value, or takes a value with a continuous distribution. In your case this might be Score = 0 with probability p which is a function of Age and Sex Score = X with probability (1-p) where X has a log-normal distribution. Whether using such a model, for data arising in the context you refer to, is reasonable depends on whether Calcium Score = 0 is a reasonable description of a biological state of things. Even if not a reasonable biological state, it may be a reasonable description of the outcome of a measurement process (e.g. too small to measure), in which case there may be a consequential issue -- what is the likely distribution of calcium values which give rise to Score = 0? (Though your data may be uninformative about this). However, if your aim is simply predicting calcium scores, then this may be irrelevant. With such a model, you should be able to make progress by using a log-linear model for the probability p (which may be adequately addressed by simply using a logistic regression for the event Score = 0 or equivalently score != 0, though you may need to be careful about how you represent Age as a covariate; Sex, being binary, should not present problems). This then allowes you to predict the probability of zero score, and the complementary probability of non-zero score. Then you can consider the problem of estimating the relationship between Score and (Age, Sex) conditional on Score != 0. This, in turn, is no more (and no less!) complicated than estimating the continuous distribution of non-zero scores from the subset of the data which carries such scores. If the distribution of non-zero scores were (as you suggest) a simple log-normal distribution, then a regression of log(Score) on Age and Sex might do well. However, from your description, it may not be a simple log-normal. The absence of scores between 0 and 300, and the containment of score values betweem 300 and 600, suggests a 3-parameter log-normal in which, as well as the mean and SD for the normal distribution of log(X) there is also a lower limit S0, so that it is log(S - S0) which has the N(mean,SD^2) distribution. The distribution might be more complicated than this. So, in summary, provided a semi-continuous model is acceptable, you can proceed by estimating its two aspects separately: The discrete part by a logistic (or other suitable binary) regression, using 'glm' in R; the continuous part by a suitable regression (using e.g. 'lm' in R) perhaps after suitable transformation (though this may need care). In each case, it is only the relevant part of the data (the proportions with Score = 0 and Score != 0 on the one hand, the values of Score where Score != 0 on the other hand, in each case using the corresponding (Age, Sex) as covariates) which will be needed. Once you have these estimated models, they can be used straightforwardly for prediction: Given Age and Sex, the Score will be zero with estimated probability p(Age,Sex) or, with probability (1 - p(Age,Sex)), will have a distribution implied by your regression. So the structure of the predicted values will be the same as the structure of the observed values. All very straightforward, provided this is a reasonable way to go. However, there is a complication in that the above might well not be a reasonable model (as hinted at above). As an example, consider the following (purely hypothetical assumptions). 1. The true distribution of Calcium Score is (say) simple log-normal such that log(Score) is normal with mean linearly dependent on Age and Sex, in all subjects. 2. In attempting to measure true Score (i.e. in obtaining observed Calcium Score data), there is a probability that Score = 0 will be obtained, and this probability depends on the true Score (e.g. the smaller the true Score, the higher the probability of obtaining Score = 0). The resulting non-zero score data will then
Re: [R] Prediction with multiple zeros in the dependent variable
John Sorkin wrote: I have a batch of data in each line of data contains three values, calcium score, age, and sex. I would like to predict calcium scores as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately the calcium scorers have a very ugly distribution. There are multiple zeros, and multiple values between 300 and 600. There are no values between zero and 300. Needless to say, the calcium scores are not normally distributed, however, the values between 300 and 600 have a distribution that is log normal. As you might imagine, the residuals from the regression are not normally distributed and thus violates the basic assumption of regression analyses. Does anyone have a suggestion for a method (or a transformation) that will allow me predict calcium from age and sex without violating the assumptions of the model? Thanks, John John Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics Baltimore VA Medical Center GRECC and University of Maryland School of Medicine Claude Pepper OAIC John - first I would try a proportional odds model, with zero as its own category then treating all other values as continuous or collapsing them into 20-tiles. If the PO assumption happens to hold (look at partial residual plots) you have a simple solution. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Prediction with multiple zeros in the dependent variable
On Thu, 8 Sep 2005, John Sorkin wrote: I have a batch of data in each line of data contains three values, calcium score, age, and sex. I would like to predict calcium scores as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately the calcium scorers have a very ugly distribution. There are multiple zeros, and multiple values between 300 and 600. There are no values between zero and 300. Needless to say, the calcium scores are not normally distributed, however, the values between 300 and 600 have a distribution that is log normal. [Coronary artery calcium by EBCT, I presume] Our approach to modelling calcium scores is to do it in two parts. First fit something like a logistic regression model where the outcome is zero vs non-zero calcium. Then, for the non-zero use something like a linear regression model for log calcium. You could presumably use such a model for prediction or imputation too, and you can work out means, medians etc from the two models. One particular reason for using this two-part model is that we find different predictors of zero/non-zero and of amount. This makes biological sense -- a factor that makes arterial plaques calcify might well have no impact until you have arterial plaques. Or you could use smooth quantile regression in the rq package. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Prediction with multiple zeros in the dependent variable
John: 1. As George Box long ago emphasized and proved, normality is **NOT** that important in regression, certainly not for estimation and not even for inference in balanced designs. Independence of the observations is far more important. 2. That said, it sounds like what you have here is a mixture of some sort. Before running off to do fancy modeling, I would work very hard to look for some kind of lurking variable or experimental aberration -- what was going on in the experiment or study that might have caused all the zeros? Was there an instrument problem? -- a bad reagent? -- improper handling of the samples? It might very well be that you need to throw away part of the data because it's useless, rather than artificially attempt to model it. 3. And having said that, if a comprehensive model IS called for, one rather cynical approach to take is just to add a grouping variable as a covariate that has a value of 1 for all data in the zero group and 2 for all the nonzero data. Your model is f(age,sex) = 0 for all data in group 1 and your linear or nonlinear regression for group 2. Of course, this merely cloaks the cynicism in respectable dress. It's hard for me to believe that it was Mother Nature and not some kind of experimental problem that you see. A slightly less cynical approach might be to use some sort of changepoint model (in both age and sex) of the form f(age, sex) = g(age,sex) for age=k1 and sex =k2 and h(age,sex) otherwise. Well, perhaps **not** less cynical -- the response data are so widely separated that you'll just be using a bunch of extra (nonlinear, incidentally) parameters to essentially reproduce the use of a covariate. So I guess the point is that unless you already have a previously developed nonlinear model that could explain the behavior you see (perhaps based on some kind of mechanistic reasoning) it's not a good idea to try to develop an artificial empirical model that comprehends all the data. The fact is (a horrible phrase) that no modeling at all is needed for the most important message the data have to convey: rather, focus on the cause of the message instead of statistical artifice. Once you have determined that, you may be able to do something sensible. Clear thinking trumps muddy modeling every time. (Hopefully, this is sufficiently inflammatory that others will vigorously and wisely dispute me). Cheers, -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA The business of the statistician is to catalyze the scientific learning process. - George E. P. Box -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John Sorkin Sent: Wednesday, September 07, 2005 9:06 PM To: r-help@stat.math.ethz.ch Subject: [R] Prediction with multiple zeros in the dependent variable I have a batch of data in each line of data contains three values, calcium score, age, and sex. I would like to predict calcium scores as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately the calcium scorers have a very ugly distribution. There are multiple zeros, and multiple values between 300 and 600. There are no values between zero and 300. Needless to say, the calcium scores are not normally distributed, however, the values between 300 and 600 have a distribution that is log normal. As you might imagine, the residuals from the regression are not normally distributed and thus violates the basic assumption of regression analyses. Does anyone have a suggestion for a method (or a transformation) that will allow me predict calcium from age and sex without violating the assumptions of the model? Thanks, John John Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics Baltimore VA Medical Center GRECC and University of Maryland School of Medicine Claude Pepper OAIC University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 410-605-7119 -- NOTE NEW EMAIL ADDRESS: [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html