Re: [R] Variable shortlisting for the logistic regression
Frank's remark was made in response to my posting. As funny as it was - it was the best thing that could have happened to me. It sparked an enlightening discussion between my committee and me (in particular, the pros cons of stepwise vs. information theoretic approach to model selection). Being new to the R help list, I had no idea who Frank was. I googled him (and asked around) and found very quickly that he should be taken seriously. And so should his remark. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Rolf Turner Sent: Thursday, October 16, 2008 1:34 PM To: useR Cc: r-help@r-project.org Subject: Re: [R] Variable shortlisting for the logistic regression On 17/10/2008, at 8:22 AM, useR wrote: Let's try to bring this discussion back again after Frank made very funny remark! Frank's remark was *serious*. Take it seriously. cheers, Rolf Turner ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. No virus found in this incoming message. Checked by AVG - http://www.avg.com 8:02 PM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of pasted 'code' lines?
Michael Not sure if it's cool to flog other software here ... But I use Ultra-edit http://www.ultraedit.com/ Cheap ... Easy to use ... And has text editing functions that save A LOT of time. Just my two cents. Darin -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Just Sent: Tuesday, October 14, 2008 12:40 PM To: Erik Iverson; [EMAIL PROTECTED] Cc: r-help Subject: Re: [R] Maximum number of pasted 'code' lines? Erik, Roger, others: Why I use excel: the ability to concatenate and 'drag' formulas. I use it because it is what I know. Apparently, excel is frowned upon, what should I be using? I don't know how else to create many very similar lines of R code. Thanks again, Michael On Tue, Oct 14, 2008 at 1:35 PM, Erik Iverson [EMAIL PROTECTED]wrote: Michael Just wrote: Hello, I write most of my R code in excel and then paste it into R. Do you actually use Excel as a text editor? Is this common? What benefits do you get by writing code in Excel? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. No virus found in this incoming message. Checked by AVG - http://www.avg.com 2:02 AM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Programming Workshops
Good morning Does R (or R users) have a formal training workshop/facility? Or do open-source softwares generally put the onus on the user to learn programming techniques? Are the workshops generally offered for the masses or is there one-on-one training available? Does one go to R for programming training ... or do R workshops travel around the country? Thanks for your time. Darin Brooks Geomatics/GIS/Remote Sensing Coordinator Kim Forest Management Ltd. Cranbrook Office Cranbrook, BC www.kfm.ca http://www.kfm.ca/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: logistic regression
I certainly appreciate your comments, Bert. It is abundantly clear that I won't be invited to any of the cocktail parties hosted by the polite circles. I am not a statistician. I am merely a geographer (in the field of ecology) trying to develop a predictor to assist in a forestry-based decision making process. My work in the natural world has taught me that NOTHING is predictable ... and the very idea of a bullet-proof ecological predictive model is doomed to fail. That said, there ARE some basic predictors that assist foresters in their salvage decisions. They use these on a daily basis. The problem is that most of the evidence and modeling is anecdotal. There really are no models in the field that I am working in. And for good reason ... The natural world isn't interested in being modeled. I think we can all agree on this - guru or not. But even the most basic predictive model (using only the GIS/mappable data that is readily available to most users) is a starting point. The resultant dataset(s) of this potential model will be followed-up and field verified. Providing this simple starting point (or catalyst if you will)could potentially save A LOT of time and money. What I need to do is to isolate the best available variables into a model and assign a confidence to it. It doesn't have to change everyone's world ... it just has to change the way of thinking in my small little world. These past few days have been an education for me in the subject of stepwise regression. I approach it with much more apprehension now. So if nothing else good comes of this discussion/exercise/experience ... I've learned something. Darin Brooks -Original Message- From: Bert Gunter [mailto:[EMAIL PROTECTED] Sent: Sunday, September 28, 2008 6:26 PM To: 'David Winsemius'; 'Darin Brooks' Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [R] FW: logistic regression The Inferno awaits me -- but I cannot resist a comment (but DO look at Frank's website). There is a deep and disconcerting dissonance here. Scientists are (naturally) interested in getting at mechanisms, and so want to know which of the variables count and which do not. But statistical analysis -- **any** statistical analysis -- cannot tell you that. All statistical analysis can do is build models that give good predictions (and only over the range of the data). The models you get depend **both** on the way Nature works **and** the peculiarities of your data (which is what Frank referred to in his comment on data reduction). In fact, it is highly likely that with your data there are many alternative prediction equations built from different collections of covariates that perform essentially equally well. Sometimes it is otherwise, typically when prospective, carefully designed studies are performed -- there is a reason that the FDA insists on clinical trials, after all (and reasons why such studies are difficult and expensive to do!). The belief that data mining (as it is known in the polite circles that Frank obviously eschews) is an effective (and even automated!) tool for discovering how Nature works is a misconception, but one that for many reasons is enthusiastically promoted. If you are looking only to predict, it may do; but you are deceived if you hope for Truth. Can you get hints? -- well maybe, maybe not. Chaos beckons. I think many -- maybe even most -- statisticians rue the day that stepwise regression was invented and certainly that it has been marketed as a tool for winnowing out the important few variables from the blizzard of irrelevant background noise. Pogo was right: We have seen the enemy -- and it is us. (As I said, the Inferno awaits...) Cheers to all, Bert Gunter DEFINITELY MY OWN OPINIONS HERE! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Winsemius Sent: Saturday, September 27, 2008 5:34 PM To: Darin Brooks Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [R] FW: logistic regression It's more a statement that it expresses a statistical perspective very succinctly, somewhat like a Zen koan. Frank's book,Regression Modeling Strategies, has entire chapters on reasoned approaches to your question. His website also has quite a bit of material free for the taking. -- David Winsemius Heritage Laboratories On Sep 27, 2008, at 7:24 PM, Darin Brooks wrote: Glad you were amused. I assume that booking this as a fortune means that this was an idiotic way to model the data? MARS? Boosted Regression Trees? Any of these a better choice to extract significant predictors (from a list of about 44) for a measured dependent variable? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] ] On Behalf Of Ted Harding Sent: Saturday, September 27, 2008 4:30 PM To: [EMAIL PROTECTED] Subject: Re: [R] FW: logistic regression On 27-Sep-08 21:45:23, Dieter Menne wrote: Frank E Harrell Jr f.harrell
Re: [R] FW: logistic regression
Wow. I had no idea. I was told to be wary ... But nothing this bold. I appreciate your straight forward advice. I will be exploring the R packages: rpart, earth, and gbm. Dr Elith has generously provided me with literature and R support in the boosted regression tree arena. I will leave stepwise logistic regression alone. Any parting advice regarding narrowing down the variables from the unruly 44 to about 8 or 10? (In addition to your advice regarding redundancy analysis and penalized maximum likelihood estimation). And I visited your website Dr. Harrell. A LOT of help there. I will also be purchasing your book this week. Wish I would have stumbled on this forum a year ago. Thanks again. -Original Message- From: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] Sent: Sunday, September 28, 2008 8:23 PM To: Darin Brooks Cc: 'Bert Gunter'; r-help@r-project.org Subject: Re: [R] FW: logistic regression Darin Brooks wrote: I certainly appreciate your comments, Bert. It is abundantly clear that I won't be invited to any of the cocktail parties hosted by the polite circles. I am not a statistician. I am merely a geographer (in the field of ecology) trying to develop a predictor to assist in a forestry-based decision making process. My work in the natural world has taught me that NOTHING is predictable ... and the very idea of a bullet-proof ecological predictive model is doomed to fail. That said, there ARE some basic predictors that assist foresters in their salvage decisions. They use these on a daily basis. The problem is that most of the evidence and modeling is anecdotal. There really are no models in the field that I am working in. And for good reason ... The natural world isn't interested in being modeled. I think we can all agree on this - guru or not. But even the most basic predictive model (using only the GIS/mappable data that is readily available to most users) is a starting point. The resultant dataset(s) of this potential model will be followed-up and field verified. Providing this simple starting point (or catalyst if you will)could potentially save A LOT of time and money. What I need to do is to isolate the best available variables into a model and assign a confidence to it. It doesn't have to change everyone's world ... it just has to change the way of thinking in my small little world. These past few days have been an education for me in the subject of stepwise regression. I approach it with much more apprehension now. So if nothing else good comes of this discussion/exercise/experience ... I've learned something. Darin Brooks Darin, I think the point is that the confidence you can assign to the best available variables is zero. That is the probability that stepwise variable selection will select the correct variables. It is probably better to build a model based on the knowledge in the field you alluded to, rather than to use P-values to decide. Frank Harrell -Original Message- From: Bert Gunter [mailto:[EMAIL PROTECTED] Sent: Sunday, September 28, 2008 6:26 PM To: 'David Winsemius'; 'Darin Brooks' Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [R] FW: logistic regression The Inferno awaits me -- but I cannot resist a comment (but DO look at Frank's website). There is a deep and disconcerting dissonance here. Scientists are (naturally) interested in getting at mechanisms, and so want to know which of the variables count and which do not. But statistical analysis -- **any** statistical analysis -- cannot tell you that. All statistical analysis can do is build models that give good predictions (and only over the range of the data). The models you get depend **both** on the way Nature works **and** the peculiarities of your data (which is what Frank referred to in his comment on data reduction). In fact, it is highly likely that with your data there are many alternative prediction equations built from different collections of covariates that perform essentially equally well. Sometimes it is otherwise, typically when prospective, carefully designed studies are performed -- there is a reason that the FDA insists on clinical trials, after all (and reasons why such studies are difficult and expensive to do!). The belief that data mining (as it is known in the polite circles that Frank obviously eschews) is an effective (and even automated!) tool for discovering how Nature works is a misconception, but one that for many reasons is enthusiastically promoted. If you are looking only to predict, it may do; but you are deceived if you hope for Truth. Can you get hints? -- well maybe, maybe not. Chaos beckons. I think many -- maybe even most -- statisticians rue the day that stepwise regression was invented and certainly that it has been marketed as a tool for winnowing out the important few variables from the blizzard
[R] FW: logistic regression
BECLBL08[T.SBS mc 3] 1.402e+00 5.824e-01 2.408 0.016043 * BECLBL08[T.SBS mk 1] -2.388e+00 7.529e-01 -3.172 0.001514 ** BECLBL08[T.SBS mw] -1.672e+01 1.393e+03 -0.012 0.990425 BECLBL08[T.SBS vk] -1.614e+01 1.243e+03 -0.013 0.989640 BECLBL08[T.SBS wk 1] -3.640e+00 8.174e-01 -4.453 8.48e-06 *** BECLBL08[T.SBS wk 3] -1.838e+01 1.363e+03 -0.013 0.989240 PEM_SScat[T.B] -1.815e+01 3.956e+03 -0.005 0.996339 PEM_SScat[T.C]1.998e-01 3.925e-01 0.509 0.610792 PEM_SScat[T.D] -2.314e-01 3.215e-01 -0.720 0.471621 PEM_SScat[T.E]5.581e-01 3.433e-01 1.626 0.104020 PEM_SScat[T.F] -1.113e+00 5.782e-01 -1.926 0.054153 . PEM_SScat[T.G]1.780e-01 4.420e-01 0.403 0.687150 PEM_SScat[T.H]1.670e+01 3.956e+03 0.004 0.996633 PEM_SScat[T.I]2.751e-01 9.313e-01 0.295 0.767705 PEM_SScat[T.J] -2.623e-01 9.693e-01 -0.271 0.786649 PEM_SScat[T.K] -1.862e+01 3.956e+03 -0.005 0.996244 PEM_SScat[T.L] -1.661e+01 1.211e+03 -0.014 0.989056 SOIL_NUTR[T.C] -1.119e+00 3.781e-01 -2.960 0.003073 ** SOIL_NUTR[T.D] -7.912e-02 9.049e-01 -0.087 0.930320 cSEEDSRCE_SW -1.512e-03 4.930e-04 -3.066 0.002170 ** cMSP 1.808e-02 5.304e-03 3.409 0.000652 *** ceFFP 2.889e-01 4.662e-02 6.196 5.80e-10 *** cEXT_Cold-1.880e+00 3.330e-01 -5.647 1.63e-08 *** There should be a PEM_Sscat[T.A]. It is the most prevalent occurrence in this category. ORG_CODE is missing more than 6 categories in the list SOIL_NUTR should have a [T.B] Does that help? -Original Message- From: Kevin E. Thorpe [mailto:[EMAIL PROTECTED] Sent: Saturday, September 27, 2008 6:21 AM To: Darin Brooks Cc: r-help@r-project.org Subject: Re: [R] logistic regression Darin Brooks wrote: Good afternoon I have what I hope is a simple logistic regression issue. I started with 44 independent variables and then used the drop1, test=chisq to reduce the list to 8 significant independent variables. drop1(sep22lr, test=Chisq) and wound up with this model: Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold 4 of the remaining variables are categorical and 4 are continuous. However, when I run a glm and then a summary on the glm - some of the categorical data is missing from the output. The PEM_SScat is missing only one variable ... the BECLBL08 is missing several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR is missing 1 variable. It seems arbitrary to the number of variables missing. Is there something wrong with my syntax in calling the logistic model? Am I not understanding the inputs correctly? Any help would be appreciated. I'm not sure I fully understand your question. It sounds like you created your own dummy variables for your categorical variables. Did you? Or did you use factor variables for your categorical variables? If the latter, then I REALLY don't understand your question. Kevin -- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Dalla Lana School of Public Health University of Toronto email: [EMAIL PROTECTED] Tel: 416.864.5776 Fax: 416.864.6057 No virus found in this incoming message. Checked by AVG - http://www.avg.com 6:55 PM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: logistic regression
Glad you were amused. I assume that booking this as a fortune means that this was an idiotic way to model the data? MARS? Boosted Regression Trees? Any of these a better choice to extract significant predictors (from a list of about 44) for a measured dependent variable? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ted Harding Sent: Saturday, September 27, 2008 4:30 PM To: [EMAIL PROTECTED] Subject: Re: [R] FW: logistic regression On 27-Sep-08 21:45:23, Dieter Menne wrote: Frank E Harrell Jr f.harrell at vanderbilt.edu writes: Estimates from this model (and especially standard errors and P-values) will be invalid because they do not take into account the stepwise procedure above that was used to torture the data until they confessed. Frank Please book this as a fortune. Dieter Seconded! Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 27-Sep-08 Time: 23:30:19 -- XFMail -- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. No virus found in this incoming message. Checked by AVG - http://www.avg.com 6:55 PM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression
Good afternoon I have what I hope is a simple logistic regression issue. I started with 44 independent variables and then used the drop1, test=chisq to reduce the list to 8 significant independent variables. drop1(sep22lr, test=Chisq) and wound up with this model: Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold 4 of the remaining variables are categorical and 4 are continuous. However, when I run a glm and then a summary on the glm - some of the categorical data is missing from the output. The PEM_SScat is missing only one variable ... the BECLBL08 is missing several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR is missing 1 variable. It seems arbitrary to the number of variables missing. Is there something wrong with my syntax in calling the logistic model? Am I not understanding the inputs correctly? Any help would be appreciated. Darin Brooks Geomatics/GIS/Remote Sensing Coordinator Kim Forest Management Ltd. Cranbrook Office Cranbrook, BC www.kfm.ca http://www.kfm.ca/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] gbm error
Good afternoon Has anyone tried using Dr. Elith's BRT script? I cannot seem to run gbm.step from the installed gbm package. Is it something external to gbm? When I run the script itself - gbm.step(data=model.data, gbm.x = colx:coly, gbm.y = colz, family = bernoulli, tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5) ... I keep encountering the same error: ERROR: unexpected ')' in bag.fraction = 0.5) I've tried all sorts of variations (such as) sep22BRT.lr01 - gbm{data=sep22BRT, gbm.x = sep22BRT[,3:42], gbm.y = sep22BRT[,1], family = bernoulli, tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5} and cannot find the problem. Is there a glaring error that I am overlooking? Darin Brooks Geomatics/GIS/Remote Sensing Coordinator Kim Forest Management Ltd. Cranbrook Office Cranbrook, BC [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] CART Analysis
Good evening Does R have an extension/add-on package that assists in Classification and Regression Tree analysis? Thanks for your time Darin Brooks Geomatics/GIS/Remote Sensing Coordinator Kim Forest Management Ltd. Cranbrook Office Cranbrook, BC Checked by AVG. 12:59 PM [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R Commander question
Good afternoon New to R ... new to the list. I have installed R Commander 1.2-9 and it functions perfectly. I would, however, like to upgrade my Rcmdr to version 1.3-15 ... but I can't seem to shake the 1.2 version. Do you have any tips on how to upgrade? Thanks for your time and consideration Darin Brooks Geomatics/GIS/Remote Sensing Coordinator Kim Forest Management Ltd. Cranbrook Office Cranbrook, BC Checked by AVG. 5:58 PM [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Commander question
Windows XP Pro -Original Message- From: John Kane [mailto:[EMAIL PROTECTED] Sent: Monday, July 14, 2008 2:57 PM To: r-help@r-project.org; Darin Brooks Subject: Re: [R] R Commander question What OS? --- On Mon, 7/14/08, Darin Brooks [EMAIL PROTECTED] wrote: From: Darin Brooks [EMAIL PROTECTED] Subject: [R] R Commander question To: r-help@r-project.org Received: Monday, July 14, 2008, 1:56 PM Good afternoon New to R ... new to the list. I have installed R Commander 1.2-9 and it functions perfectly. I would, however, like to upgrade my Rcmdr to version 1.3-15 ... but I can't seem to shake the 1.2 version. Do you have any tips on how to upgrade? Thanks for your time and consideration Darin Brooks Geomatics/GIS/Remote Sensing Coordinator Kim Forest Management Ltd. Cranbrook Office Cranbrook, BC Checked by AVG. 5:58 PM [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your favourite sites. Download it now at http://ca.toolbar.yahoo.com. No virus found in this incoming message. Checked by AVG. 5:58 PM Checked by AVG. 5:58 PM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.