Re: [R] subset selection for logistic regression

2005-03-02 Thread Frank E Harrell Jr
Wittner, Ben wrote:
R-packages leaps and subselect implement various methods of selecting best or
good subsets of predictor variables for linear regression models, but they do
not seem to be applicable to logistic regression models.
 
Does anyone know of software for finding good subsets of predictor variables for
linear regression models?
 
Thanks.
 
-Ben
Why are these procedures still being used?  The performance is known to 
be bad in almost every sense (see r-help archives).

Frank Harrell
 
p.s., The leaps package references Subset Selection in Regression by Alan
Miller. On page 2 of the
2nd edition of that text it states the following:
 
  All of the models which will be considered in this monograph will be linear;
that is they
   will be linear in the regression coefficients.Though most of the ideas and
problems carry
   over to the fitting of nonlinear models and generalized linear models
(particularly the fitting
   of logistic relationships), the complexity is greatly increased.

--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] subset selection for logistic regression

2005-03-02 Thread dr mike
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Wittner, Ben
 Sent: 02 March 2005 11:33
 To: [EMAIL PROTECTED]
 Subject: [R] subset selection for logistic regression
 
 R-packages leaps and subselect implement various methods of 
 selecting best or good subsets of predictor variables for 
 linear regression models, but they do not seem to be 
 applicable to logistic regression models.
  
 Does anyone know of software for finding good subsets of 
 predictor variables for linear regression models?
  
 Thanks.
  
 -Ben
  
 p.s., The leaps package references Subset Selection in 
 Regression by Alan Miller. On page 2 of the 2nd edition of 
 that text it states the following:
  
   All of the models which will be considered in this 
 monograph will be linear; that is they
will be linear in the regression coefficients.Though most 
 of the ideas and problems carry
over to the fitting of nonlinear models and generalized 
 linear models (particularly the fitting
of logistic relationships), the complexity is greatly increased.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 

The LASSO method and the Least Angle Regression method are two such that
have both been implemented (efficiently IMHO - only one least squares for
all levels of shrinkage IIRC) in the lars package for R of Hastie and Efron.
There is a paper by Madigan and Ridgeway that discusses the use of the Least
Angle Regresson approach in the context of logistic regression - available
for download from Madigan's space at Ruttgers: 
www.stat.rutgers.edu/~madigan/PAPERS/lars3.pdf 

HTH

Mike

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] subset selection for logistic regression

2005-03-02 Thread Frank E Harrell Jr
dr mike wrote:
 


-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Wittner, Ben
Sent: 02 March 2005 11:33
To: [EMAIL PROTECTED]
Subject: [R] subset selection for logistic regression

R-packages leaps and subselect implement various methods of 
selecting best or good subsets of predictor variables for 
linear regression models, but they do not seem to be 
applicable to logistic regression models.

Does anyone know of software for finding good subsets of 
predictor variables for linear regression models?

Thanks.
-Ben
p.s., The leaps package references Subset Selection in 
Regression by Alan Miller. On page 2 of the 2nd edition of 
that text it states the following:

 All of the models which will be considered in this 
monograph will be linear; that is they
  will be linear in the regression coefficients.Though most 
of the ideas and problems carry
  over to the fitting of nonlinear models and generalized 
linear models (particularly the fitting
  of logistic relationships), the complexity is greatly increased.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html


The LASSO method and the Least Angle Regression method are two such that
have both been implemented (efficiently IMHO - only one least squares for
all levels of shrinkage IIRC) in the lars package for R of Hastie and Efron.
There is a paper by Madigan and Ridgeway that discusses the use of the Least
Angle Regresson approach in the context of logistic regression - available
for download from Madigan's space at Ruttgers: 
www.stat.rutgers.edu/~madigan/PAPERS/lars3.pdf 

HTH
Mike
Yes things like lasso can help a lot.
--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] subset selection for logistic regression

2005-03-02 Thread Berton Gunter
To clarify Frank's remark ...

A prominent theme in statistical research over at least the last 25 years
(with roots that go back 50 or more, probably) has been the superiority of
shrinkage methods over variable selection. I also find it distressing that
these ideas have apparently not penetrated much (at all?) into the wider
scientific community (but I suppose I shouldn't be surprised -- most
scientists still do one factor at a time experiments 80 years after Fisher).
Specific incarnations can be found in anything Bayesian, mixed effects
models for repeated measures, ridge regression, and the R packages lars and
lasso, among others.

I would speculate that aside from the usual statistics/science cultural
issues, part of the reason for this is that the estimators don't generally
come with neat, classical inference procedures: like it or not, many
scientists have been conditioned by their Stat 101 courses to expect P
values, so in some sense, we are hoisted by our own petard.

Just my $.02 -- contrary(and more knowledgeable) opinions welcome.

-- Bert Gunter
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Frank 
 E Harrell Jr
 Sent: Wednesday, March 02, 2005 5:13 AM
 To: Wittner, Ben
 Cc: [EMAIL PROTECTED]
 Subject: Re: [R] subset selection for logistic regression
 
 Wittner, Ben wrote:
  R-packages leaps and subselect implement various methods of 
 selecting best or
  good subsets of predictor variables for linear regression 
 models, but they do
  not seem to be applicable to logistic regression models.
   
  Does anyone know of software for finding good subsets of 
 predictor variables for
  linear regression models?
   
  Thanks.
   
  -Ben
 
 Why are these procedures still being used?  The performance 
 is known to 
 be bad in almost every sense (see r-help archives).
 
 Frank Harrell
 
   
  p.s., The leaps package references Subset Selection in 
 Regression by Alan
  Miller. On page 2 of the
  2nd edition of that text it states the following:
   
All of the models which will be considered in this 
 monograph will be linear;
  that is they
 will be linear in the regression coefficients.Though 
 most of the ideas and
  problems carry
 over to the fitting of nonlinear models and generalized 
 linear models
  (particularly the fitting
 of logistic relationships), the complexity is greatly increased.
 
 
 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   
 Vanderbilt University
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] subset selection for logistic regression

2005-03-02 Thread Christian Hennig
Perhaps I should not write it because I will discredit myself with this
but...

Suppose I have a setup with 100 variables and some 1000 cases and I want to
boil down the number of variables to a maximum of 10 for practical reasons
even if I lose 10% prediction quality by this (for example because it is
expensive to measure all variables on new cases).  

Is it really so wrong to use a stepwise method?
Let's say I divide the sample into three parts and do variable selction on
the first part, estimation on the second and test on the third part (this
solves almost all problems Frank is talking about on p. 56/57 in his
excellent book). Is there always a tractable alternative? 

Of course it is wrong to interpret the selected variables as the true
influences and all others as unrelated, but if I don't do that?

If it should really be a taboo to do stepwise variable selection, why are p.
58/59 of Regression Modeling Strategies devoted to how to do it of you
must?

Please forget my name;-)

Christian

On Wed, 2 Mar 2005, Berton Gunter wrote:

 To clarify Frank's remark ...
 
 A prominent theme in statistical research over at least the last 25 years
 (with roots that go back 50 or more, probably) has been the superiority of
 shrinkage methods over variable selection. I also find it distressing that
 these ideas have apparently not penetrated much (at all?) into the wider
 scientific community (but I suppose I shouldn't be surprised -- most
 scientists still do one factor at a time experiments 80 years after Fisher).
 Specific incarnations can be found in anything Bayesian, mixed effects
 models for repeated measures, ridge regression, and the R packages lars and
 lasso, among others.
 
 I would speculate that aside from the usual statistics/science cultural
 issues, part of the reason for this is that the estimators don't generally
 come with neat, classical inference procedures: like it or not, many
 scientists have been conditioned by their Stat 101 courses to expect P
 values, so in some sense, we are hoisted by our own petard.
 
 Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
 
 -- Bert Gunter
  
 
  -Original Message-
  From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED] On Behalf Of Frank 
  E Harrell Jr
  Sent: Wednesday, March 02, 2005 5:13 AM
  To: Wittner, Ben
  Cc: [EMAIL PROTECTED]
  Subject: Re: [R] subset selection for logistic regression
  
  Wittner, Ben wrote:
   R-packages leaps and subselect implement various methods of 
  selecting best or
   good subsets of predictor variables for linear regression 
  models, but they do
   not seem to be applicable to logistic regression models.

   Does anyone know of software for finding good subsets of 
  predictor variables for
   linear regression models?

   Thanks.

   -Ben
  
  Why are these procedures still being used?  The performance 
  is known to 
  be bad in almost every sense (see r-help archives).
  
  Frank Harrell
  

   p.s., The leaps package references Subset Selection in 
  Regression by Alan
   Miller. On page 2 of the
   2nd edition of that text it states the following:

 All of the models which will be considered in this 
  monograph will be linear;
   that is they
  will be linear in the regression coefficients.Though 
  most of the ideas and
   problems carry
  over to the fitting of nonlinear models and generalized 
  linear models
   (particularly the fitting
  of logistic relationships), the complexity is greatly increased.
  
  
  -- 
  Frank E Harrell Jr   Professor and Chair   School of Medicine
Department of Biostatistics   
  Vanderbilt University
  
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide! 
  http://www.R-project.org/posting-guide.html
 
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
 

***
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
[EMAIL PROTECTED], http://www.math.uni-hamburg.de/home/hennig/
From 1 April 2005: Department of Statistical Science, UCL, London
###
ich empfehle www.boag-online.de

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] subset selection for logistic regression

2005-03-02 Thread Frank E Harrell Jr
Christian Hennig wrote:
Perhaps I should not write it because I will discredit myself with this
but...
Suppose I have a setup with 100 variables and some 1000 cases and I want to
boil down the number of variables to a maximum of 10 for practical reasons
even if I lose 10% prediction quality by this (for example because it is
expensive to measure all variables on new cases).  

Is it really so wrong to use a stepwise method?
Yes.  Read about model uncertainty and bias in models developed using 
stepwise methods.  One exception: if there is a large number of 
variables with truly zero regression coefficients, and the rest are not 
very weak, stepwise can sort things out fairly well.  But you never know 
this in advance.

Let's say I divide the sample into three parts and do variable selction on
the first part, estimation on the second and test on the third part (this
solves almost all problems Frank is talking about on p. 56/57 in his
excellent book). Is there always a tractable alternative? 
That's a good way to find out how bad the method is, not to fix the 
problems inherent in it.

Of course it is wrong to interpret the selected variables as the true
influences and all others as unrelated, but if I don't do that?
If it should really be a taboo to do stepwise variable selection, why are p.
58/59 of Regression Modeling Strategies devoted to how to do it of you
must?
Stress on if.  And note that if you ask what is the optimum alpha for 
variables to be kept in the model when doing backwards stepdown, it's 
alpha=1.0.  A good compromise is alpha=0.5.  See

@Article{ste01pro,
  author = 		 {Steyerberg, Ewout W. and Eijkemans, Marinus
  J. C. and Harrell, Frank E. and Habbema, J. Dik F.},
  title = 		 {Prognostic modeling with logistic regression
  analysis: {In} search of a sensible strategy in small data sets},
  journal = 	 Medical Decision Making,
  year = 		 2001,
  volume =		 21,
  pages =		 {45-56},
  annote =		 {shrinkage; variable selection; dichotomization of
  continuous varibles; sign of regression coefficient; calibration; 
validation}
}

And on Bert's excellent question about why shrinkage is not used more 
often, here is our attempt at a remedy:

@Article{moo04pen,
  author = 		 {Moons, K. G. M. and Donders, A. Rogier T. and
Steyerberg, E. W. and Harrell, F. E.},
  title = 		 {Penalized maximum likelihood estimation to directly
adjust diagnostic and prognostic prediction models for overoptimism: a
clinical example},
  journal = 	 J Clinical Epidemiology,
  year = 		 2004,
  volume =		 57,
  pages =		 {1262-1270},
  annote =		 {prediction 
research;overoptimism;overfitting;penalization;bootstrapping;shrinkage}
}

Frank

Please forget my name;-)
Christian
On Wed, 2 Mar 2005, Berton Gunter wrote:

To clarify Frank's remark ...
A prominent theme in statistical research over at least the last 25 years
(with roots that go back 50 or more, probably) has been the superiority of
shrinkage methods over variable selection. I also find it distressing that
these ideas have apparently not penetrated much (at all?) into the wider
scientific community (but I suppose I shouldn't be surprised -- most
scientists still do one factor at a time experiments 80 years after Fisher).
Specific incarnations can be found in anything Bayesian, mixed effects
models for repeated measures, ridge regression, and the R packages lars and
lasso, among others.
I would speculate that aside from the usual statistics/science cultural
issues, part of the reason for this is that the estimators don't generally
come with neat, classical inference procedures: like it or not, many
scientists have been conditioned by their Stat 101 courses to expect P
values, so in some sense, we are hoisted by our own petard.
Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
-- Bert Gunter

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Frank 
E Harrell Jr
Sent: Wednesday, March 02, 2005 5:13 AM
To: Wittner, Ben
Cc: [EMAIL PROTECTED]
Subject: Re: [R] subset selection for logistic regression

Wittner, Ben wrote:
R-packages leaps and subselect implement various methods of 
selecting best or
good subsets of predictor variables for linear regression 
models, but they do
not seem to be applicable to logistic regression models.
Does anyone know of software for finding good subsets of 
predictor variables for
linear regression models?
Thanks.
-Ben
Why are these procedures still being used?  The performance 
is known to 
be bad in almost every sense (see r-help archives).

Frank Harrell

p.s., The leaps package references Subset Selection in 
Regression by Alan
Miller. On page 2 of the
2nd edition of that text it states the following:
 All of the models which will be considered in this 
monograph will be linear;
that is they
  will be linear in the regression coefficients.Though 
most of the ideas and
problems carry
  over to the fitting of nonlinear models and generalized 
linear models
(particularly