Re: [R] subset selection for logistic regression
Wittner, Ben wrote: R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben Why are these procedures still being used? The performance is known to be bad in almost every sense (see r-help archives). Frank Harrell p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] subset selection for logistic regression
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Wittner, Ben Sent: 02 March 2005 11:33 To: [EMAIL PROTECTED] Subject: [R] subset selection for logistic regression R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html The LASSO method and the Least Angle Regression method are two such that have both been implemented (efficiently IMHO - only one least squares for all levels of shrinkage IIRC) in the lars package for R of Hastie and Efron. There is a paper by Madigan and Ridgeway that discusses the use of the Least Angle Regresson approach in the context of logistic regression - available for download from Madigan's space at Ruttgers: www.stat.rutgers.edu/~madigan/PAPERS/lars3.pdf HTH Mike __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset selection for logistic regression
dr mike wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Wittner, Ben Sent: 02 March 2005 11:33 To: [EMAIL PROTECTED] Subject: [R] subset selection for logistic regression R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html The LASSO method and the Least Angle Regression method are two such that have both been implemented (efficiently IMHO - only one least squares for all levels of shrinkage IIRC) in the lars package for R of Hastie and Efron. There is a paper by Madigan and Ridgeway that discusses the use of the Least Angle Regresson approach in the context of logistic regression - available for download from Madigan's space at Ruttgers: www.stat.rutgers.edu/~madigan/PAPERS/lars3.pdf HTH Mike Yes things like lasso can help a lot. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] subset selection for logistic regression
To clarify Frank's remark ... A prominent theme in statistical research over at least the last 25 years (with roots that go back 50 or more, probably) has been the superiority of shrinkage methods over variable selection. I also find it distressing that these ideas have apparently not penetrated much (at all?) into the wider scientific community (but I suppose I shouldn't be surprised -- most scientists still do one factor at a time experiments 80 years after Fisher). Specific incarnations can be found in anything Bayesian, mixed effects models for repeated measures, ridge regression, and the R packages lars and lasso, among others. I would speculate that aside from the usual statistics/science cultural issues, part of the reason for this is that the estimators don't generally come with neat, classical inference procedures: like it or not, many scientists have been conditioned by their Stat 101 courses to expect P values, so in some sense, we are hoisted by our own petard. Just my $.02 -- contrary(and more knowledgeable) opinions welcome. -- Bert Gunter -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr Sent: Wednesday, March 02, 2005 5:13 AM To: Wittner, Ben Cc: [EMAIL PROTECTED] Subject: Re: [R] subset selection for logistic regression Wittner, Ben wrote: R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben Why are these procedures still being used? The performance is known to be bad in almost every sense (see r-help archives). Frank Harrell p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] subset selection for logistic regression
Perhaps I should not write it because I will discredit myself with this but... Suppose I have a setup with 100 variables and some 1000 cases and I want to boil down the number of variables to a maximum of 10 for practical reasons even if I lose 10% prediction quality by this (for example because it is expensive to measure all variables on new cases). Is it really so wrong to use a stepwise method? Let's say I divide the sample into three parts and do variable selction on the first part, estimation on the second and test on the third part (this solves almost all problems Frank is talking about on p. 56/57 in his excellent book). Is there always a tractable alternative? Of course it is wrong to interpret the selected variables as the true influences and all others as unrelated, but if I don't do that? If it should really be a taboo to do stepwise variable selection, why are p. 58/59 of Regression Modeling Strategies devoted to how to do it of you must? Please forget my name;-) Christian On Wed, 2 Mar 2005, Berton Gunter wrote: To clarify Frank's remark ... A prominent theme in statistical research over at least the last 25 years (with roots that go back 50 or more, probably) has been the superiority of shrinkage methods over variable selection. I also find it distressing that these ideas have apparently not penetrated much (at all?) into the wider scientific community (but I suppose I shouldn't be surprised -- most scientists still do one factor at a time experiments 80 years after Fisher). Specific incarnations can be found in anything Bayesian, mixed effects models for repeated measures, ridge regression, and the R packages lars and lasso, among others. I would speculate that aside from the usual statistics/science cultural issues, part of the reason for this is that the estimators don't generally come with neat, classical inference procedures: like it or not, many scientists have been conditioned by their Stat 101 courses to expect P values, so in some sense, we are hoisted by our own petard. Just my $.02 -- contrary(and more knowledgeable) opinions welcome. -- Bert Gunter -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr Sent: Wednesday, March 02, 2005 5:13 AM To: Wittner, Ben Cc: [EMAIL PROTECTED] Subject: Re: [R] subset selection for logistic regression Wittner, Ben wrote: R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben Why are these procedures still being used? The performance is known to be bad in almost every sense (see r-help archives). Frank Harrell p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html *** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg [EMAIL PROTECTED], http://www.math.uni-hamburg.de/home/hennig/ From 1 April 2005: Department of Statistical Science, UCL, London ### ich empfehle www.boag-online.de __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset selection for logistic regression
Christian Hennig wrote: Perhaps I should not write it because I will discredit myself with this but... Suppose I have a setup with 100 variables and some 1000 cases and I want to boil down the number of variables to a maximum of 10 for practical reasons even if I lose 10% prediction quality by this (for example because it is expensive to measure all variables on new cases). Is it really so wrong to use a stepwise method? Yes. Read about model uncertainty and bias in models developed using stepwise methods. One exception: if there is a large number of variables with truly zero regression coefficients, and the rest are not very weak, stepwise can sort things out fairly well. But you never know this in advance. Let's say I divide the sample into three parts and do variable selction on the first part, estimation on the second and test on the third part (this solves almost all problems Frank is talking about on p. 56/57 in his excellent book). Is there always a tractable alternative? That's a good way to find out how bad the method is, not to fix the problems inherent in it. Of course it is wrong to interpret the selected variables as the true influences and all others as unrelated, but if I don't do that? If it should really be a taboo to do stepwise variable selection, why are p. 58/59 of Regression Modeling Strategies devoted to how to do it of you must? Stress on if. And note that if you ask what is the optimum alpha for variables to be kept in the model when doing backwards stepdown, it's alpha=1.0. A good compromise is alpha=0.5. See @Article{ste01pro, author = {Steyerberg, Ewout W. and Eijkemans, Marinus J. C. and Harrell, Frank E. and Habbema, J. Dik F.}, title = {Prognostic modeling with logistic regression analysis: {In} search of a sensible strategy in small data sets}, journal = Medical Decision Making, year = 2001, volume = 21, pages = {45-56}, annote = {shrinkage; variable selection; dichotomization of continuous varibles; sign of regression coefficient; calibration; validation} } And on Bert's excellent question about why shrinkage is not used more often, here is our attempt at a remedy: @Article{moo04pen, author = {Moons, K. G. M. and Donders, A. Rogier T. and Steyerberg, E. W. and Harrell, F. E.}, title = {Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example}, journal = J Clinical Epidemiology, year = 2004, volume = 57, pages = {1262-1270}, annote = {prediction research;overoptimism;overfitting;penalization;bootstrapping;shrinkage} } Frank Please forget my name;-) Christian On Wed, 2 Mar 2005, Berton Gunter wrote: To clarify Frank's remark ... A prominent theme in statistical research over at least the last 25 years (with roots that go back 50 or more, probably) has been the superiority of shrinkage methods over variable selection. I also find it distressing that these ideas have apparently not penetrated much (at all?) into the wider scientific community (but I suppose I shouldn't be surprised -- most scientists still do one factor at a time experiments 80 years after Fisher). Specific incarnations can be found in anything Bayesian, mixed effects models for repeated measures, ridge regression, and the R packages lars and lasso, among others. I would speculate that aside from the usual statistics/science cultural issues, part of the reason for this is that the estimators don't generally come with neat, classical inference procedures: like it or not, many scientists have been conditioned by their Stat 101 courses to expect P values, so in some sense, we are hoisted by our own petard. Just my $.02 -- contrary(and more knowledgeable) opinions welcome. -- Bert Gunter -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr Sent: Wednesday, March 02, 2005 5:13 AM To: Wittner, Ben Cc: [EMAIL PROTECTED] Subject: Re: [R] subset selection for logistic regression Wittner, Ben wrote: R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben Why are these procedures still being used? The performance is known to be bad in almost every sense (see r-help archives). Frank Harrell p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly