Re: [R] Discretize continous variables....
Frank E Harrell Jr [EMAIL PROTECTED] [Sun, Jul 20, 2008 at 12:20:28AM CEST]: Johannes Huesing wrote: Because regulatory bodies demand it? [...] And how anyway does this relate to predictors in a model? Not at all; you're correct. I was mixing the topic of this discussion up with another kind of silliness. I had a discussion with a biometrician in a pharmaceutical company though who stated that when you have only one df to spend it will be better to dichotomise it at a clinically meaningful point than to include it as a linear term. He kept the discussion on the ground of laboratory measurements like sodium, where a deviation from normal ranges is very significant (and unlike, say, cholesterol, where you have a gradual interpretation of the value). He has a point there, but in general the reason for sacrificing information is a mixture of laziness, the preference for presenting data in tables and to keep the modelling consistent with the tables (for instance to assign an odds ratio to each cell). -- Johannes Hüsing There is something fascinating about science. One gets such wholesale returns of conjecture mailto:[EMAIL PROTECTED] from such a trifling investment of fact. http://derwisch.wikidot.com (Mark Twain, Life on the Mississippi) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
Johannes Huesing wrote: Frank E Harrell Jr [EMAIL PROTECTED] [Sun, Jul 20, 2008 at 12:20:28AM CEST]: Johannes Huesing wrote: Because regulatory bodies demand it? [...] And how anyway does this relate to predictors in a model? Not at all; you're correct. I was mixing the topic of this discussion up with another kind of silliness. I had a discussion with a biometrician in a pharmaceutical company though who stated that when you have only one df to spend it will be better to dichotomise it at a clinically meaningful point than to include it as a linear term. He kept the discussion on the ground of laboratory measurements like sodium, where a deviation from normal ranges is very significant (and unlike, say, cholesterol, where you have a gradual interpretation of the value). He has a point there, but in general the reason for sacrificing information is a mixture of laziness, the preference for presenting data in tables and to keep the modelling consistent with the tables (for instance to assign an odds ratio to each cell). Nice points. I think the desire to be able to present things in tables is a major reason. The biometrician's idea that a piecewise flat line with one jump will fit a dataset better than a linear effect is quite a leap in logic. If I only have one d.f. to spend I'll take linear any day, but better to spend a little more and fit a smooth nonlinear relationship. A coherent approach is to shrink the fit down to the effective number of parameters the dataset will support estimating. There is no clinical laboratory measure that has a jump discontinuity in its effect on mortality or other patient outcomes. The fact that reference ranges exist (which are based only on supposedly normal subjects and don't related to the risk of an outcome) doesn't mean we should use them in formulated independent or dependent variables. It is common but distorted logic to want to make an odds ratio in a model be comparable to one in a table from which regression coefficients were just anti-logged (so that 1-unit changes could be used). The tabled odds ratio is a kind of crude population averaged odds ratio that may not apply to a single subject in the study. My book has many examples where laboratory measurements are related to risk using restricted cubic splines. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. Many thanks helpers. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
This time I agree with Rolf Turner. This sounds like homework. Whether or not, type ?ifelse in the R-prompt. Frank is right, it leads to a loss in information. However, I think it remains interpretable. Further, it is common practice in certain fields, and it maybe a reasonable way to check whether mostly outliers in the X drive your results (although other approaches are available for that as well). The main underlying question however should be, do you have reason to expect that the response is different by the groups you create rather than in the numbers of the continuous variable. Regarding question 2: I thought you mean that you want to reduce the number of levels (say 4) to a smaller number of levels (say 2) for one of your independent variables (i.e. one of the Xs), not Y. This makes sense only, if there is any good conceptual reason to group these categories - not just to get significance. Best, Daniel Frank E Harrell Jr wrote: milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. Many thanks helpers. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Discretize-continous-variables-tp18544453p18545292.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
Daniel Malter wrote: This time I agree with Rolf Turner. This sounds like homework. Whether or not, type ?ifelse in the R-prompt. Frank is right, it leads to a loss in information. However, I think it remains interpretable. Further, it is common practice in certain fields, and I have to disagree. It is easy to show that odds ratios so obtained are functions of the entire distribution of the predictor in question. Thus they do not estimate a scientific quantity (something that can be interpreted out of context). For example if age is cut at 65 and one were to add to the sample several subjects aged 100, the =65 : 65 odds ratio would change even if the age effect did not. it maybe a reasonable way to check whether mostly outliers in the X drive your results (although other approaches are available for that as well). The main underlying question however should be, do you have reason to expect that the response is different by the groups you create rather than in the numbers of the continuous variable. Regression splines can help. Sometimes the splines are stated in terms of the cube root of the predictor to avoid excess influence. Frank Regarding question 2: I thought you mean that you want to reduce the number of levels (say 4) to a smaller number of levels (say 2) for one of your independent variables (i.e. one of the Xs), not Y. This makes sense only, if there is any good conceptual reason to group these categories - not just to get significance. Best, Daniel Frank E Harrell Jr wrote: milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. -- -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
True. Thanks for the clarification. Is your conclusion from that that the findings in such case should only be interpreted in the specific context (with the awareness that it does not apply to changing contexts) or that such an approach should not be taken at all? Frank E Harrell Jr wrote: Daniel Malter wrote: This time I agree with Rolf Turner. This sounds like homework. Whether or not, type ?ifelse in the R-prompt. Frank is right, it leads to a loss in information. However, I think it remains interpretable. Further, it is common practice in certain fields, and I have to disagree. It is easy to show that odds ratios so obtained are functions of the entire distribution of the predictor in question. Thus they do not estimate a scientific quantity (something that can be interpreted out of context). For example if age is cut at 65 and one were to add to the sample several subjects aged 100, the =65 : 65 odds ratio would change even if the age effect did not. it maybe a reasonable way to check whether mostly outliers in the X drive your results (although other approaches are available for that as well). The main underlying question however should be, do you have reason to expect that the response is different by the groups you create rather than in the numbers of the continuous variable. Regression splines can help. Sometimes the splines are stated in terms of the cube root of the predictor to avoid excess influence. Frank Regarding question 2: I thought you mean that you want to reduce the number of levels (say 4) to a smaller number of levels (say 2) for one of your independent variables (i.e. one of the Xs), not Y. This makes sense only, if there is any good conceptual reason to group these categories - not just to get significance. Best, Daniel Frank E Harrell Jr wrote: milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. -- -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/Discretize-continous-variables-tp18544453p18546919.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
Daniel Malter wrote: True. Thanks for the clarification. Is your conclusion from that that the findings in such case should only be interpreted in the specific context (with the awareness that it does not apply to changing contexts) or that such an approach should not be taken at all? The latter, in general; in specific cases the former. But even then why condition on incomplete information when complete information is available? I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)? Frank Frank E Harrell Jr wrote: Daniel Malter wrote: This time I agree with Rolf Turner. This sounds like homework. Whether or not, type ?ifelse in the R-prompt. Frank is right, it leads to a loss in information. However, I think it remains interpretable. Further, it is common practice in certain fields, and I have to disagree. It is easy to show that odds ratios so obtained are functions of the entire distribution of the predictor in question. Thus they do not estimate a scientific quantity (something that can be interpreted out of context). For example if age is cut at 65 and one were to add to the sample several subjects aged 100, the =65 : 65 odds ratio would change even if the age effect did not. it maybe a reasonable way to check whether mostly outliers in the X drive your results (although other approaches are available for that as well). The main underlying question however should be, do you have reason to expect that the response is different by the groups you create rather than in the numbers of the continuous variable. Regression splines can help. Sometimes the splines are stated in terms of the cube root of the predictor to avoid excess influence. Frank Regarding question 2: I thought you mean that you want to reduce the number of levels (say 4) to a smaller number of levels (say 2) for one of your independent variables (i.e. one of the Xs), not Y. This makes sense only, if there is any good conceptual reason to group these categories - not just to get significance. Best, Daniel Frank E Harrell Jr wrote: milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. -- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
Frank/Danial, Thank you for very good discussion on this. The reason I'm doing this is because is it common industrial practice to group continous varible (say age) in couple of buckets while developming scorecards to be used by business people. I don't see the reason why I shouldn't discretize variable AGE if manage to maintain same information or reduce it slightly. However, I do agree that reading your book will be of grait benefit. Thanks a lot and keep discussion live On Jul 19, 7:03 pm, Frank E Harrell Jr [EMAIL PROTECTED] wrote: Daniel Malter wrote: True. Thanks for the clarification. Is your conclusion from that that the findings in such case should only be interpreted in the specific context (with the awareness that it does not apply to changing contexts) or that such an approach should not be taken at all? The latter, in general; in specific cases the former. But even then why condition on incomplete information when complete information is available? I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)? Frank Frank E Harrell Jr wrote: Daniel Malter wrote: This time I agree with Rolf Turner. This sounds like homework. Whether or not, type ?ifelse in the R-prompt. Frank is right, it leads to a loss in information. However, I think it remains interpretable. Further, it is common practice in certain fields, and I have to disagree. It is easy to show that odds ratios so obtained are functions of the entire distribution of the predictor in question. Thus they do not estimate a scientific quantity (something that can be interpreted out of context). For example if age is cut at 65 and one were to add to the sample several subjects aged 100, the =65 : 65 odds ratio would change even if the age effect did not. it maybe a reasonable way to check whether mostly outliers in the X drive your results (although other approaches are available for that as well). The main underlying question however should be, do you have reason to expect that the response is different by the groups you create rather than in the numbers of the continuous variable. Regression splines can help. Sometimes the splines are stated in terms of the cube root of the predictor to avoid excess influence. Frank Regarding question 2: I thought you mean that you want to reduce the number of levels (say 4) to a smaller number of levels (say 2) for one of your independent variables (i.e. one of the Xs), not Y. This makes sense only, if there is any good conceptual reason to group these categories - not just to get significance. Best, Daniel Frank E Harrell Jr wrote: milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. -- __ [EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
Frank E Harrell Jr [EMAIL PROTECTED] [Sat, Jul 19, 2008 at 08:03:01PM CEST]: But even then why condition on incomplete information when complete information is available? I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)? Because regulatory bodies demand it? Being employed in a medical school you are certainly aware that regulatory bodies are very much into eliciting a benefit in terms of rate of subjects cured and do not believe in a treatment effect expressed as a mere shift in the parameter. (Not that this notion weren't my pet peeve; but it's there and we have to deal with it.) -- Johannes Hüsing There is something fascinating about science. One gets such wholesale returns of conjecture mailto:[EMAIL PROTECTED] from such a trifling investment of fact. http://derwisch.wikidot.com (Mark Twain, Life on the Mississippi) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
Johannes Huesing wrote: Frank E Harrell Jr [EMAIL PROTECTED] [Sat, Jul 19, 2008 at 08:03:01PM CEST]: But even then why condition on incomplete information when complete information is available? I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)? Because regulatory bodies demand it? Being employed in a medical school you are certainly aware that regulatory bodies are very much into eliciting a benefit in terms of rate of subjects cured and do not believe in a treatment effect expressed as a mere shift in the parameter. (Not that this notion weren't my pet peeve; but it's there and we have to deal with it.) Johannes, It is a mistake to believe that regulatory authorities always require this just because they occasionally do. This is more in the imagination of pharmaceutical company medical staff. And how anyway does this relate to predictors in a model? If statisticians don't stand up to this silliness who is going to? Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Discretize continous variables....
milicic.marko wrote: Frank/Danial, Thank you for very good discussion on this. The reason I'm doing this is because is it common industrial practice to group continous varible (say age) in couple of buckets while developming scorecards to be used by business people. I don't see the reason why I shouldn't discretize variable AGE if manage to maintain same information or reduce it slightly. However, I do agree that reading your book will be of grait benefit. Thanks a lot and keep discussion live Thanks for your note. Categorizing age will adversely affect the scorecard. First, since you are introducing discontinuities into the prediction model, people can game the system to exploit the discontinuity. Second, lost information from age will have to be made up by adding another variable to the model that you might not have needed had the full age variable been adjusted for. Third, if you chop age into enough intervals to preserve the predictive value (hard to do especially in the outer age ranges where sample sizes do not permit cutting but where the age effect is sharp) you will find that the mean squared error of predicted values is higher than if you treated age as a continuous variable and just forced its effect to be smooth (e.g., using a regression spline). Frank On Jul 19, 7:03 pm, Frank E Harrell Jr [EMAIL PROTECTED] wrote: Daniel Malter wrote: True. Thanks for the clarification. Is your conclusion from that that the findings in such case should only be interpreted in the specific context (with the awareness that it does not apply to changing contexts) or that such an approach should not be taken at all? The latter, in general; in specific cases the former. But even then why condition on incomplete information when complete information is available? I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)? Frank Frank E Harrell Jr wrote: Daniel Malter wrote: This time I agree with Rolf Turner. This sounds like homework. Whether or not, type ?ifelse in the R-prompt. Frank is right, it leads to a loss in information. However, I think it remains interpretable. Further, it is common practice in certain fields, and I have to disagree. It is easy to show that odds ratios so obtained are functions of the entire distribution of the predictor in question. Thus they do not estimate a scientific quantity (something that can be interpreted out of context). For example if age is cut at 65 and one were to add to the sample several subjects aged 100, the =65 : 65 odds ratio would change even if the age effect did not. it maybe a reasonable way to check whether mostly outliers in the X drive your results (although other approaches are available for that as well). The main underlying question however should be, do you have reason to expect that the response is different by the groups you create rather than in the numbers of the continuous variable. Regression splines can help. Sometimes the splines are stated in terms of the cube root of the predictor to avoid excess influence. Frank Regarding question 2: I thought you mean that you want to reduce the number of levels (say 4) to a smaller number of levels (say 2) for one of your independent variables (i.e. one of the Xs), not Y. This makes sense only, if there is any good conceptual reason to group these categories - not just to get significance. Best, Daniel Frank E Harrell Jr wrote: milicic.marko wrote: Hi R helpers, I'm preparing dataset to fir logistic regression model with lrm(). I have various cointinous and discrete variables and I would like to: 1. Optimaly discretize continous variables (Optimaly means, maximizing information value - IV for example) This will result in effects in the model that cannot be interpreted and will ruin the statistical inference from the lrm. It will also hurt predictive discrimination. You seem to be allergic to continuous variables. 2. Regroup discrete variables to achieve perhaps smaller number of level and better information value... If you use the Y variable to do this the same problems will result. Shrinkage is a better approach, or using marginal frequencies to combine levels. See the pre-specification of complexity strategy in my book Regression Modeling Strategies. Frank Please suggest if there is some package providing this or same functionality for discretization... if there is no package plese suggest how to achieve this. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.