subject:"Re\: \[R\] Discretize continous variables...."

Re: [R] Discretize continous variables....

2008-07-20 Thread Johannes Huesing

Frank E Harrell Jr [EMAIL PROTECTED] [Sun, Jul 20, 2008 at 12:20:28AM CEST]:
 Johannes Huesing wrote:
 Because regulatory bodies demand it? 
[...]
 
 And how anyway does this  
 relate to predictors in a model?

Not at all; you're correct. I was mixing the topic of this discussion
up with another kind of silliness.

I had a discussion with a biometrician in a pharmaceutical company
though who stated that when you have only one df to spend it will be
better to dichotomise it at a clinically meaningful point than to
include it as a linear term. He kept the discussion on the ground of
laboratory measurements like sodium, where a deviation from normal
ranges is very significant (and unlike, say, cholesterol, where you
have a gradual interpretation of the value). He has a point there, but
in general the reason for sacrificing information is a mixture of
laziness, the preference for presenting data in tables and to keep the
modelling consistent with the tables (for instance to assign an odds
ratio to each cell).
-- 
Johannes Hüsing   There is something fascinating about science. 
  One gets such wholesale returns of conjecture 
mailto:[EMAIL PROTECTED]  from such a trifling investment of fact.  
  
http://derwisch.wikidot.com (Mark Twain, Life on the Mississippi)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-20 Thread Frank E Harrell Jr


Johannes Huesing wrote:

Frank E Harrell Jr [EMAIL PROTECTED] [Sun, Jul 20, 2008 at 12:20:28AM CEST]:

Johannes Huesing wrote:
Because regulatory bodies demand it? 

[...]
And how anyway does this  
relate to predictors in a model?


Not at all; you're correct. I was mixing the topic of this discussion
up with another kind of silliness.

I had a discussion with a biometrician in a pharmaceutical company
though who stated that when you have only one df to spend it will be
better to dichotomise it at a clinically meaningful point than to
include it as a linear term. He kept the discussion on the ground of
laboratory measurements like sodium, where a deviation from normal
ranges is very significant (and unlike, say, cholesterol, where you
have a gradual interpretation of the value). He has a point there, but
in general the reason for sacrificing information is a mixture of
laziness, the preference for presenting data in tables and to keep the
modelling consistent with the tables (for instance to assign an odds
ratio to each cell).


Nice points.  I think the desire to be able to present things in tables 
is a major reason.


The biometrician's idea that a piecewise flat line with one jump will 
fit a dataset better than a linear effect is quite a leap in logic.  If 
I only have one d.f. to spend I'll take linear any day, but better to 
spend a little more and fit a smooth nonlinear relationship.  A coherent 
approach is to shrink the fit down to the effective number of parameters 
the dataset will support estimating.


There is no clinical laboratory measure that has a jump discontinuity in 
its effect on mortality or other patient outcomes.  The fact that 
reference ranges exist (which are based only on supposedly normal 
subjects and don't related to the risk of an outcome) doesn't mean we 
should use them in formulated independent or dependent variables.


It is common but distorted logic to want to make an odds ratio in a 
model be comparable to one in a table from which regression coefficients 
were just anti-logged (so that 1-unit changes could be used).  The 
tabled odds ratio is a kind of crude population averaged odds ratio that 
may not apply to a single subject in the study.


My book has many examples where laboratory measurements are related to 
risk using restricted cubic splines.


Frank


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Frank E Harrell Jr


milicic.marko wrote:

Hi R helpers,


I'm preparing dataset to fir logistic regression model with lrm(). I
have various cointinous and discrete variables and I would like to:

1. Optimaly discretize continous variables (Optimaly means, maximizing
information value - IV for example)


This will result in effects in the model that cannot be interpreted and 
will ruin the statistical inference from the lrm.  It will also hurt 
predictive discrimination.  You seem to be allergic to continuous variables.



2. Regroup discrete variables to achieve perhaps smaller number of
level and better information value...


If you use the Y variable to do this the same problems will result. 
Shrinkage is a better approach, or using marginal frequencies to combine 
levels.  See the pre-specification of complexity strategy in my book 
Regression Modeling Strategies.


Frank




Please suggest if there is some package providing this or same
functionality for discretization...


if there is no package plese suggest how to achieve this.




Many thanks helpers.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Daniel Malter


This time I agree with Rolf Turner. This sounds like homework. Whether or
not, type

?ifelse

in the R-prompt.

Frank is right, it leads to a loss in information. However, I think it
remains interpretable. Further, it is common practice in certain fields, and
it maybe a reasonable way to check whether mostly outliers in the X drive
your results (although other approaches are available for that as well). The
main underlying question however should be, do you have reason to expect
that the response is different by the groups you create rather than in the
numbers of the continuous variable. 

Regarding question 2: I thought you mean that you want to reduce the number
of levels (say 4) to a smaller number of levels (say 2) for one of your
independent variables (i.e. one of the Xs), not Y. This makes sense only, if
there is any good conceptual reason to group these categories - not just to
get significance.

Best,
Daniel





Frank E Harrell Jr wrote:
 
 milicic.marko wrote:
 Hi R helpers,
 
 
 I'm preparing dataset to fir logistic regression model with lrm(). I
 have various cointinous and discrete variables and I would like to:
 
 1. Optimaly discretize continous variables (Optimaly means, maximizing
 information value - IV for example)
 
 This will result in effects in the model that cannot be interpreted and 
 will ruin the statistical inference from the lrm.  It will also hurt 
 predictive discrimination.  You seem to be allergic to continuous
 variables.
 
 2. Regroup discrete variables to achieve perhaps smaller number of
 level and better information value...
 
 If you use the Y variable to do this the same problems will result. 
 Shrinkage is a better approach, or using marginal frequencies to combine 
 levels.  See the pre-specification of complexity strategy in my book 
 Regression Modeling Strategies.
 
 Frank
 
 
 
 Please suggest if there is some package providing this or same
 functionality for discretization...
 
 
 if there is no package plese suggest how to achieve this.
 
 
 
 
 Many thanks helpers.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   Vanderbilt University
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Discretize-continous-variables-tp18544453p18545292.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Frank E Harrell Jr


Daniel Malter wrote:

This time I agree with Rolf Turner. This sounds like homework. Whether or
not, type

?ifelse

in the R-prompt.

Frank is right, it leads to a loss in information. However, I think it
remains interpretable. Further, it is common practice in certain fields, and


I have to disagree.  It is easy to show that odds ratios so obtained are 
functions of the entire distribution of the predictor in question.  Thus 
they do not estimate a scientific quantity (something that can be 
interpreted out of context).  For example if age is cut at 65 and one 
were to add to the sample several subjects aged 100, the =65 : 65 odds 
ratio would change even if the age effect did not.



it maybe a reasonable way to check whether mostly outliers in the X drive
your results (although other approaches are available for that as well). The
main underlying question however should be, do you have reason to expect
that the response is different by the groups you create rather than in the
numbers of the continuous variable. 


Regression splines can help.  Sometimes the splines are stated in terms 
of the cube root of the predictor to avoid excess influence.


Frank



Regarding question 2: I thought you mean that you want to reduce the number
of levels (say 4) to a smaller number of levels (say 2) for one of your
independent variables (i.e. one of the Xs), not Y. This makes sense only, if
there is any good conceptual reason to group these categories - not just to
get significance.

Best,
Daniel





Frank E Harrell Jr wrote:

milicic.marko wrote:

Hi R helpers,


I'm preparing dataset to fir logistic regression model with lrm(). I
have various cointinous and discrete variables and I would like to:

1. Optimaly discretize continous variables (Optimaly means, maximizing
information value - IV for example)
This will result in effects in the model that cannot be interpreted and 
will ruin the statistical inference from the lrm.  It will also hurt 
predictive discrimination.  You seem to be allergic to continuous

variables.


2. Regroup discrete variables to achieve perhaps smaller number of
level and better information value...
If you use the Y variable to do this the same problems will result. 
Shrinkage is a better approach, or using marginal frequencies to combine 
levels.  See the pre-specification of complexity strategy in my book 
Regression Modeling Strategies.


Frank



Please suggest if there is some package providing this or same
functionality for discretization...


if there is no package plese suggest how to achieve this.



 --

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Daniel Malter


True. Thanks for the clarification. Is your conclusion from that that the
findings in such case should only be interpreted in the specific context
(with the awareness that it does not apply to changing contexts) or that
such an approach should not be taken at all?


Frank E Harrell Jr wrote:
 
 Daniel Malter wrote:
 This time I agree with Rolf Turner. This sounds like homework. Whether or
 not, type
 
 ?ifelse
 
 in the R-prompt.
 
 Frank is right, it leads to a loss in information. However, I think it
 remains interpretable. Further, it is common practice in certain fields,
 and
 
 I have to disagree.  It is easy to show that odds ratios so obtained are 
 functions of the entire distribution of the predictor in question.  Thus 
 they do not estimate a scientific quantity (something that can be 
 interpreted out of context).  For example if age is cut at 65 and one 
 were to add to the sample several subjects aged 100, the =65 : 65 odds 
 ratio would change even if the age effect did not.
 
 it maybe a reasonable way to check whether mostly outliers in the X drive
 your results (although other approaches are available for that as well).
 The
 main underlying question however should be, do you have reason to expect
 that the response is different by the groups you create rather than in
 the
 numbers of the continuous variable. 
 
 Regression splines can help.  Sometimes the splines are stated in terms 
 of the cube root of the predictor to avoid excess influence.
 
 Frank
 
 
 Regarding question 2: I thought you mean that you want to reduce the
 number
 of levels (say 4) to a smaller number of levels (say 2) for one of your
 independent variables (i.e. one of the Xs), not Y. This makes sense only,
 if
 there is any good conceptual reason to group these categories - not just
 to
 get significance.
 
 Best,
 Daniel
 
 
 
 
 
 Frank E Harrell Jr wrote:
 milicic.marko wrote:
 Hi R helpers,


 I'm preparing dataset to fir logistic regression model with lrm(). I
 have various cointinous and discrete variables and I would like to:

 1. Optimaly discretize continous variables (Optimaly means, maximizing
 information value - IV for example)
 This will result in effects in the model that cannot be interpreted and 
 will ruin the statistical inference from the lrm.  It will also hurt 
 predictive discrimination.  You seem to be allergic to continuous
 variables.

 2. Regroup discrete variables to achieve perhaps smaller number of
 level and better information value...
 If you use the Y variable to do this the same problems will result. 
 Shrinkage is a better approach, or using marginal frequencies to combine 
 levels.  See the pre-specification of complexity strategy in my book 
 Regression Modeling Strategies.

 Frank


 Please suggest if there is some package providing this or same
 functionality for discretization...


 if there is no package plese suggest how to achieve this.


   --
 
 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   Vanderbilt University
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 

-- 
View this message in context: 
http://www.nabble.com/Discretize-continous-variables-tp18544453p18546919.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Frank E Harrell Jr


Daniel Malter wrote:

True. Thanks for the clarification. Is your conclusion from that that the
findings in such case should only be interpreted in the specific context
(with the awareness that it does not apply to changing contexts) or that
such an approach should not be taken at all?


The latter, in general;  in specific cases the former.  But even then 
why condition on incomplete information when complete information is 
available?  I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)?


Frank




Frank E Harrell Jr wrote:

Daniel Malter wrote:

This time I agree with Rolf Turner. This sounds like homework. Whether or
not, type

?ifelse

in the R-prompt.

Frank is right, it leads to a loss in information. However, I think it
remains interpretable. Further, it is common practice in certain fields,
and
I have to disagree.  It is easy to show that odds ratios so obtained are 
functions of the entire distribution of the predictor in question.  Thus 
they do not estimate a scientific quantity (something that can be 
interpreted out of context).  For example if age is cut at 65 and one 
were to add to the sample several subjects aged 100, the =65 : 65 odds 
ratio would change even if the age effect did not.



it maybe a reasonable way to check whether mostly outliers in the X drive
your results (although other approaches are available for that as well).
The
main underlying question however should be, do you have reason to expect
that the response is different by the groups you create rather than in
the
numbers of the continuous variable. 
Regression splines can help.  Sometimes the splines are stated in terms 
of the cube root of the predictor to avoid excess influence.


Frank


Regarding question 2: I thought you mean that you want to reduce the
number
of levels (say 4) to a smaller number of levels (say 2) for one of your
independent variables (i.e. one of the Xs), not Y. This makes sense only,
if
there is any good conceptual reason to group these categories - not just
to
get significance.

Best,
Daniel





Frank E Harrell Jr wrote:

milicic.marko wrote:

Hi R helpers,


I'm preparing dataset to fir logistic regression model with lrm(). I
have various cointinous and discrete variables and I would like to:

1. Optimaly discretize continous variables (Optimaly means, maximizing
information value - IV for example)
This will result in effects in the model that cannot be interpreted and 
will ruin the statistical inference from the lrm.  It will also hurt 
predictive discrimination.  You seem to be allergic to continuous

variables.


2. Regroup discrete variables to achieve perhaps smaller number of
level and better information value...
If you use the Y variable to do this the same problems will result. 
Shrinkage is a better approach, or using marginal frequencies to combine 
levels.  See the pre-specification of complexity strategy in my book 
Regression Modeling Strategies.


Frank


Please suggest if there is some package providing this or same
functionality for discretization...


if there is no package plese suggest how to achieve this.



  --



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread milicic.marko

Frank/Danial,

Thank you for very good discussion on this.

The reason I'm doing this is because is it common industrial practice
to group continous varible (say age) in couple of buckets while
developming scorecards to be used by business people. I don't see the
reason why I shouldn't discretize variable AGE if manage to maintain
same information or reduce it slightly.

However, I do agree that reading your book will be of grait benefit.


Thanks a lot and keep discussion live





On Jul 19, 7:03 pm, Frank E Harrell Jr [EMAIL PROTECTED]
wrote:
 Daniel Malter wrote:
  True. Thanks for the clarification. Is your conclusion from that that the
  findings in such case should only be interpreted in the specific context
  (with the awareness that it does not apply to changing contexts) or that
  such an approach should not be taken at all?

 The latter, in general;  in specific cases the former.  But even then
 why condition on incomplete information when complete information is
 available?  I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)?

 Frank





  Frank E Harrell Jr wrote:
  Daniel Malter wrote:
  This time I agree with Rolf Turner. This sounds like homework. Whether or
  not, type

  ?ifelse

  in the R-prompt.

  Frank is right, it leads to a loss in information. However, I think it
  remains interpretable. Further, it is common practice in certain fields,
  and
  I have to disagree.  It is easy to show that odds ratios so obtained are
  functions of the entire distribution of the predictor in question.  Thus
  they do not estimate a scientific quantity (something that can be
  interpreted out of context).  For example if age is cut at 65 and one
  were to add to the sample several subjects aged 100, the =65 : 65 odds
  ratio would change even if the age effect did not.

  it maybe a reasonable way to check whether mostly outliers in the X drive
  your results (although other approaches are available for that as well).
  The
  main underlying question however should be, do you have reason to expect
  that the response is different by the groups you create rather than in
  the
  numbers of the continuous variable.
  Regression splines can help.  Sometimes the splines are stated in terms
  of the cube root of the predictor to avoid excess influence.

  Frank

  Regarding question 2: I thought you mean that you want to reduce the
  number
  of levels (say 4) to a smaller number of levels (say 2) for one of your
  independent variables (i.e. one of the Xs), not Y. This makes sense only,
  if
  there is any good conceptual reason to group these categories - not just
  to
  get significance.

  Best,
  Daniel

  Frank E Harrell Jr wrote:
  milicic.marko wrote:
  Hi R helpers,

  I'm preparing dataset to fir logistic regression model with lrm(). I
  have various cointinous and discrete variables and I would like to:

  1. Optimaly discretize continous variables (Optimaly means, maximizing
  information value - IV for example)
  This will result in effects in the model that cannot be interpreted and
  will ruin the statistical inference from the lrm.  It will also hurt
  predictive discrimination.  You seem to be allergic to continuous
  variables.

  2. Regroup discrete variables to achieve perhaps smaller number of
  level and better information value...
  If you use the Y variable to do this the same problems will result.
  Shrinkage is a better approach, or using marginal frequencies to combine
  levels.  See the pre-specification of complexity strategy in my book
  Regression Modeling Strategies.

  Frank

  Please suggest if there is some package providing this or same
  functionality for discretization...

  if there is no package plese suggest how to achieve this.

    --

 __
 [EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Johannes Huesing

Frank E Harrell Jr [EMAIL PROTECTED] [Sat, Jul 19, 2008 at 08:03:01PM CEST]:
 But even then  
 why condition on incomplete information when complete information is  
 available?  I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)?

Because regulatory bodies demand it? Being employed in a medical school
you are certainly aware that regulatory bodies are very much into eliciting
a benefit in terms of rate of subjects cured and do not believe in
a treatment effect expressed as a mere shift in the parameter.

(Not that this notion weren't my pet peeve; but it's there and we have to 
deal with it.)


-- 
Johannes Hüsing   There is something fascinating about science. 
  One gets such wholesale returns of conjecture 
mailto:[EMAIL PROTECTED]  from such a trifling investment of fact.  
  
http://derwisch.wikidot.com (Mark Twain, Life on the Mississippi)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Frank E Harrell Jr


Johannes Huesing wrote:

Frank E Harrell Jr [EMAIL PROTECTED] [Sat, Jul 19, 2008 at 08:03:01PM CEST]:
But even then  
why condition on incomplete information when complete information is  
available?  I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)?


Because regulatory bodies demand it? Being employed in a medical school
you are certainly aware that regulatory bodies are very much into eliciting
a benefit in terms of rate of subjects cured and do not believe in
a treatment effect expressed as a mere shift in the parameter.

(Not that this notion weren't my pet peeve; but it's there and we have to 
deal with it.)





Johannes,

It is a mistake to believe that regulatory authorities always require 
this just because they occasionally do.  This is more in the imagination 
of pharmaceutical company medical staff.  And how anyway does this 
relate to predictors in a model?


If statisticians don't stand up to this silliness who is going to?

Frank


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

2008-07-19 Thread Frank E Harrell Jr


milicic.marko wrote:

Frank/Danial,

Thank you for very good discussion on this.

The reason I'm doing this is because is it common industrial practice
to group continous varible (say age) in couple of buckets while
developming scorecards to be used by business people. I don't see the
reason why I shouldn't discretize variable AGE if manage to maintain
same information or reduce it slightly.

However, I do agree that reading your book will be of grait benefit.


Thanks a lot and keep discussion live


Thanks for your note.  Categorizing age will adversely affect the 
scorecard.  First, since you are introducing discontinuities into the 
prediction model, people can game the system to exploit the 
discontinuity.  Second, lost information from age will have to be made 
up by adding another variable to the model that you might not have 
needed had the full age variable been adjusted for.  Third, if you chop 
age into enough intervals to preserve the predictive value (hard to do 
especially in the outer age ranges where sample sizes do not permit 
cutting but where the age effect is sharp) you will find that the mean 
squared error of predicted values is higher than if you treated age as a 
continuous variable and just forced its effect to be smooth (e.g., using 
a regression spline).


Frank







On Jul 19, 7:03 pm, Frank E Harrell Jr [EMAIL PROTECTED]
wrote:

Daniel Malter wrote:

True. Thanks for the clarification. Is your conclusion from that that the
findings in such case should only be interpreted in the specific context
(with the awareness that it does not apply to changing contexts) or that
such an approach should not be taken at all?

The latter, in general;  in specific cases the former.  But even then
why condition on incomplete information when complete information is
available?  I.e., why compute Pr(Y=1 | Xx) in place of Pr(Y=1 | X=x)?

Frank






Frank E Harrell Jr wrote:

Daniel Malter wrote:

This time I agree with Rolf Turner. This sounds like homework. Whether or
not, type
?ifelse
in the R-prompt.
Frank is right, it leads to a loss in information. However, I think it
remains interpretable. Further, it is common practice in certain fields,
and

I have to disagree.  It is easy to show that odds ratios so obtained are
functions of the entire distribution of the predictor in question.  Thus
they do not estimate a scientific quantity (something that can be
interpreted out of context).  For example if age is cut at 65 and one
were to add to the sample several subjects aged 100, the =65 : 65 odds
ratio would change even if the age effect did not.

it maybe a reasonable way to check whether mostly outliers in the X drive
your results (although other approaches are available for that as well).
The
main underlying question however should be, do you have reason to expect
that the response is different by the groups you create rather than in
the
numbers of the continuous variable.

Regression splines can help.  Sometimes the splines are stated in terms
of the cube root of the predictor to avoid excess influence.
Frank

Regarding question 2: I thought you mean that you want to reduce the
number
of levels (say 4) to a smaller number of levels (say 2) for one of your
independent variables (i.e. one of the Xs), not Y. This makes sense only,
if
there is any good conceptual reason to group these categories - not just
to
get significance.
Best,
Daniel
Frank E Harrell Jr wrote:

milicic.marko wrote:

Hi R helpers,
I'm preparing dataset to fir logistic regression model with lrm(). I
have various cointinous and discrete variables and I would like to:
1. Optimaly discretize continous variables (Optimaly means, maximizing
information value - IV for example)

This will result in effects in the model that cannot be interpreted and
will ruin the statistical inference from the lrm.  It will also hurt
predictive discrimination.  You seem to be allergic to continuous
variables.

2. Regroup discrete variables to achieve perhaps smaller number of
level and better information value...

If you use the Y variable to do this the same problems will result.
Shrinkage is a better approach, or using marginal frequencies to combine
levels.  See the pre-specification of complexity strategy in my book
Regression Modeling Strategies.
Frank

Please suggest if there is some package providing this or same
functionality for discretization...
if there is no package plese suggest how to achieve this.



--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

Re: [R] Discretize continous variables....

11 matches

Site Navigation

Mail list logo

Footer information