[R] logistic regression
Greetings, I am working on a logistic regression model in R and I am struggling with the code, as it is a relatively new program for me. In searching Google for 'logistic regression diagnostics' I came Elizabeth Brown's Lecture 14 from her Winter 2004 Biostatistics 515 course (http://courses.washington.edu/b515/l14.pdf) . I found most of the code to be very helpful, but I am struggling with the lines on to calculate the observed and expected values in the 10 groups created by the cut function. I get error messages in trying to create the E and O matrices: R won't accept assignment of fi1c==j and it won't calculate the sum. I am wondering whether someone might be able to offer me some assistance...my search of the archives was not fruitful. Here is the code that I adapted from the lecture notes: fit - fitted(glm.lyme) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)),1)) t-table(fitc) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)), 1), labels = F) t-table(fitc) #Calculate observed and expected values in ea group E - matrix(0, nrow=10, ncol = 2) O - matrix(0, nrow=10, ncol=2) for (j in 1:10) { E[j, 2] = sum(fit[fitc==j]) E[j, 1] = sum((1- fit)[fitc==j]) O[j, 2] = sum(pcdata$lymdis[fitc==j]) O[j, 1] = sum((1-pcdata$lymdis)[fitc==j]) } Here is the error message: Error in Summary.factor(..., na.rm = na.rm) : sum not meaningful for factors I understand what it means; I just can't figure out how to get around it or how to get the output printed in table form. Thank you in advance for any assistance. Mary Sullivan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression
Mary, The 10-group approach results in a low-resolution and fairly arbitrary calibration curve. Also, it is the basis of the original Hosmer-Lemeshow goodness of fit statistic which has been superceded by the Hosmer et al single degree of freedom GOF test that does not require any binning. The Design package handles both. Do ?calibrate.lrm, ?residuals.lrm, ?lrm for details. Frank Harrell Sullivan, Mary M wrote: Greetings, I am working on a logistic regression model in R and I am struggling with the code, as it is a relatively new program for me. In searching Google for 'logistic regression diagnostics' I came Elizabeth Brown's Lecture 14 from her Winter 2004 Biostatistics 515 course (http://courses.washington.edu/b515/l14.pdf) . I found most of the code to be very helpful, but I am struggling with the lines on to calculate the observed and expected values in the 10 groups created by the cut function. I get error messages in trying to create the E and O matrices: R won't accept assignment of fi1c==j and it won't calculate the sum. I am wondering whether someone might be able to offer me some assistance...my search of the archives was not fruitful. Here is the code that I adapted from the lecture notes: fit - fitted(glm.lyme) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)),1)) t-table(fitc) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)), 1), labels = F) t-table(fitc) #Calculate observed and expected values in ea group E - matrix(0, nrow=10, ncol = 2) O - matrix(0, nrow=10, ncol=2) for (j in 1:10) { E[j, 2] = sum(fit[fitc==j]) E[j, 1] = sum((1- fit)[fitc==j]) O[j, 2] = sum(pcdata$lymdis[fitc==j]) O[j, 1] = sum((1-pcdata$lymdis)[fitc==j]) } Here is the error message: Error in Summary.factor(..., na.rm = na.rm) : sum not meaningful for factors I understand what it means; I just can't figure out how to get around it or how to get the output printed in table form. Thank you in advance for any assistance. Mary Sullivan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression
Maybe try making sure the data is numeric: fac.to.num=function(x) as.numeric(as.character(x)) On 26-Jul-07, at 9:34 AM, Sullivan, Mary M wrote: Greetings, I am working on a logistic regression model in R and I am struggling with the code, as it is a relatively new program for me. In searching Google for 'logistic regression diagnostics' I came Elizabeth Brown's Lecture 14 from her Winter 2004 Biostatistics 515 course (http://courses.washington.edu/b515/ l14.pdf) . I found most of the code to be very helpful, but I am struggling with the lines on to calculate the observed and expected values in the 10 groups created by the cut function. I get error messages in trying to create the E and O matrices: R won't accept assignment of fi1c==j and it won't calculate the sum. I am wondering whether someone might be able to offer me some assistance...my search of the archives was not fruitful. Here is the code that I adapted from the lecture notes: fit - fitted(glm.lyme) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)),1)) t-table(fitc) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)), 1), labels = F) t-table(fitc) #Calculate observed and expected values in ea group E - matrix(0, nrow=10, ncol = 2) O - matrix(0, nrow=10, ncol=2) for (j in 1:10) { E[j, 2] = sum(fit[fitc==j]) E[j, 1] = sum((1- fit)[fitc==j]) O[j, 2] = sum(pcdata$lymdis[fitc==j]) O[j, 1] = sum((1-pcdata$lymdis)[fitc==j]) } Here is the error message: Error in Summary.factor(..., na.rm = na.rm) : sum not meaningful for factors I understand what it means; I just can't figure out how to get around it or how to get the output printed in table form. Thank you in advance for any assistance. Mary Sullivan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. -- Mike Lawrence Graduate Student, Department of Psychology, Dalhousie University Website: http://memetic.ca Public calendar: http://icalx.com/public/informavore/Public The road to wisdom? Well, it's plain and simple to express: Err and err and err again, but less and less and less. - Piet Hein __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
Bingshan Li wrote: Hi All, Now it works. Thanks for all your answers and the explanations are very clear. Bingshan But note that you are not using R correctly unless you are doing a simulation and have some special speed issues. Let the model functions do all this for you. Frank On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote: NewVar - relevel( factor(OldVar), ref = b) should create a dummy variable, and change the reference category for the model. Reza On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
Hi Frank, I do not quite get you. What do you mean by simulation and speed issues? I do not see why they have to be considered in logistic regression. Thanks. Bingshan From: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] Sent: Fri 6/29/2007 7:40 AM To: Li, Bingshan Cc: Seyed Reza Jafarzadeh; r-help@stat.math.ethz.ch Subject: Re: [R] logistic regression and dummy variable coding Bingshan Li wrote: Hi All, Now it works. Thanks for all your answers and the explanations are very clear. Bingshan But note that you are not using R correctly unless you are doing a simulation and have some special speed issues. Let the model functions do all this for you. Frank On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote: NewVar - relevel( factor(OldVar), ref = b) should create a dummy variable, and change the reference category for the model. Reza On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
Li, Bingshan wrote: Hi Frank, I do not quite get you. What do you mean by simulation and speed issues? I do not see why they have to be considered in logistic regression. Exactly. So don't use techniques that are only needed when such issues do have to be considered. Thanks. Bingshan From: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] Sent: Fri 6/29/2007 7:40 AM To: Li, Bingshan Cc: Seyed Reza Jafarzadeh; r-help@stat.math.ethz.ch Subject: Re: [R] logistic regression and dummy variable coding Bingshan Li wrote: Hi All, Now it works. Thanks for all your answers and the explanations are very clear. Bingshan But note that you are not using R correctly unless you are doing a simulation and have some special speed issues. Let the model functions do all this for you. Frank On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote: NewVar - relevel( factor(OldVar), ref = b) should create a dummy variable, and change the reference category for the model. Reza On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression and dummy variable coding
Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
On Thu, 2007-06-28 at 18:16 -0500, Bingshan Li wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! See ?relevel Note that this (creating dummy variables) will be done automatically in R's modeling functions, which default to treatment contrasts on factors. model.matrix() is used internally by model functions such as glm(). For example using a single factor: FL - factor(letters[1:5]) FL [1] a b c d e Levels: a b c d e contrasts(FL) b c d e a 0 0 0 0 b 1 0 0 0 c 0 1 0 0 d 0 0 1 0 e 0 0 0 1 FL.b - relevel(FL, b) FL.b [1] a b c d e Levels: b a c d e contrasts(FL.b) a c d e b 0 0 0 0 a 1 0 0 0 c 0 1 0 0 d 0 0 1 0 e 0 0 0 1 See ?contrasts and the Statistical Models section in An Introduction to R. HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
NewVar - relevel( factor(OldVar), ref = b) should create a dummy variable, and change the reference category for the model. Reza On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
Hi All, Now it works. Thanks for all your answers and the explanations are very clear. Bingshan On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote: NewVar - relevel( factor(OldVar), ref = b) should create a dummy variable, and change the reference category for the model. Reza On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R logistic regression - comparison with SPSS
Dear R-list members, I have been a user of SPSS for a few years and quite new to R. I read the documentation and tried samples but I have some problems to obtain results for a logistic regression under R. The following SPSS script LOGISTIC REGRESSION vir /METHOD = FSTEP(LR) d007 d008 d009 d010 d011 d012 d013 d014 d015 d016 d017 d018 d069 d072 d073 /SAVE = PRED COOK SRESID /CLASSPLOT /PRINT = GOODFIT CI(95) /CRITERIA = PIN(.10) POUT(.10) ITERATE(40) CUT(.5) . predicts vir (value 0 or 1) according to my parameters d007 to d073. It gives me the parameters to retain in the logistic equation and the intercept. The calculation is made from a set of values of about 1.000 cases. I have been unable to translate it with success under R. I would like to check if I can obtain the same results than with SPSS. Can someone help me translate it under R ? I would be most grateful. I thank you. Best regards. -- Alain Reymond CEIA Bd Saint-Michel 119 1040 Bruxelles Tel: +32 2 736 04 58 Fax: +32 2 736 58 02 PGPId : 0xEFB06E2E __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R logistic regression - comparison with SPSS
Alain Reymond wrote: Dear R-list members, I have been a user of SPSS for a few years and quite new to R. I read the documentation and tried samples but I have some problems to obtain results for a logistic regression under R. The following SPSS script LOGISTIC REGRESSION vir /METHOD = FSTEP(LR) d007 d008 d009 d010 d011 d012 d013 d014 d015 d016 d017 d018 d069 d072 d073 /SAVE = PRED COOK SRESID /CLASSPLOT /PRINT = GOODFIT CI(95) /CRITERIA = PIN(.10) POUT(.10) ITERATE(40) CUT(.5) . predicts vir (value 0 or 1) according to my parameters d007 to d073. It gives me the parameters to retain in the logistic equation and the intercept. The calculation is made from a set of values of about 1.000 cases. I have been unable to translate it with success under R. I would like to check if I can obtain the same results than with SPSS. Can someone help me translate it under R ? I would be most grateful. If all the variables you mention are available in a data frame, e.g. virdf, than you can fit a logistic regression model by mymodel - glm(vir ~ d007 + d008 + d009 + d010 + d011 + d012 + d013 + d014 + d015 + d016 + d017 + d018 + d069 + d072 + d073, data = virdf, family = binomial) or mymodel - glm(vir ~ ., data = virdf, family = binomial) if there are no variables other than those mentioned above in the virdf data frame. Contrary to SPSS you need not specify in advance what you would like as output. Everything useful is stored in the model object (here: mymodel) which can then be used to further investigate the model in many ways: summary(mymodel) anova(mymodel, test = Chisq) plot(mymodel) See ?summary.glm, ?anova.glm etc. For stepwise variable selection (not necessarily corresponding to STEP(LR)), see ?step or ?add1 to do it `by hand'. HTH, Tobias P.S. You can find an introduction to R specifically targeted at (SAS and) SPSS users here: http://oit.utk.edu/scc/RforSASSPSSusers.pdf -- Tobias Verbeke - Consultant Business Decision Benelux Rue de la révolution 8 1000 Brussels - BELGIUM +32 499 36 33 15 [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Logistic Regression Question: Risk Threshold
Hi, I am working on problem 2 of Chapter 8 in Data Analysis and Graphics Using R and don't know how to approach the second half of the question: In the data set (an artificial one of 3121 patients, that is similar to a subset of the data analyzed in Stiell et al., 2001) head.injury, obtain a logistic regression model relating clinically.important.brain.injury to other variables. Patients whose risk is sufficiently high will be sent for CT (computed tomography). Using a risk threshold of 0.025 (2.5%), turn the result into a decision rule for use of CT. This is what I have so far: names(head.injury) [1] age.65amnesia.before [3] basal.skull.fracture GCS.decrease [5] GCS.13GCS.15.2hours [7] high.risk loss.of.consciousness [9] open.skull.fracture vomiting [11] clinically.important.brain.injury attach(head.injury) head.glm = glm(clinically.important.brain.injury ~ ., family=binomial, data=head.injury) summary(head.glm) Call: glm(formula = clinically.important.brain.injury ~ ., family = binomial, data = head.injury) Deviance Residuals: Min 1Q Median 3Q Max -2.2774 -0.3511 -0.2095 -0.1489 3.0028 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept)-4.4972 0.1629 -27.611 2e-16 *** age.65 1.3734 0.1827 7.518 5.56e-14 *** amnesia.before 0.6893 0.1725 3.996 6.45e-05 *** basal.skull.fracture1.9620 0.2064 9.504 2e-16 *** GCS.decrease -0.2688 0.3680 -0.730 0.465152 GCS.13 1.0613 0.2820 3.764 0.000168 *** GCS.15.2hours 1.9408 0.1663 11.669 2e-16 *** high.risk 1.1115 0.1591 6.984 2.86e-12 *** loss.of.consciousness 0.9554 0.1959 4.877 1.08e-06 *** open.skull.fracture 0.6304 0.3151 2.001 0.045424 * vomiting1.2334 0.1961 6.290 3.17e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1741.6 on 3120 degrees of freedom Residual deviance: 1201.3 on 3110 degrees of freedom AIC: 1223.3 Number of Fisher Scoring iterations: 6 How do I assess which patients have a high risk level, and how does the risk threshold play into that? Thanks in advance, Diana - [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Logistic Regression
Dear R-Helpers, I want to perform a logistic regression on my dataset (y). I used the following code: logistic-glm(formula=interest_variable~.,family = binomial(link = logit), data = y) This run correctly. Then i want to develop the logistic regression with three different method: -forward -backward -stepwise I used these procedure: forward-step(logistica,direction=forward) backward-step(logistica,direction=backward) stepwise-step(logistica,direction=both) Even these run correctly, but i obtained the same results with the three different procedures. Then I tought i made some mistakes. My question is: Is correct what i did? Is correct that three different methods return the same results? If i made some mistakes, what is the correct method to correctly perform the three different logistics regression? Thank you in advance. Sergio Della Franca. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression
Dear All, I would like adjust and know the R2 of following presence/absence data: x-1:10 y-c(0,0,0,0,0,0,0,1,1,1) I tryed use clogit (survival package) but it don´t worked. Any idea? miltinho __ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression
On 15-Mar-07 17:03:50, Milton Cezar Ribeiro wrote: Dear All, I would like adjust and know the R2 of following presence/absence data: x-1:10 y-c(0,0,0,0,0,0,0,1,1,1) I tryed use clogit (survival package) but it don´t worked. Any idea? miltinho You are trying to fit an equation P[y = 1 ; x] = exp((x-a)/b))/(1 + exp((x-a)/b)) to data x = 1 2 3 4 5 6 7 8 9 10 y = 0 0 0 0 0 0 0 1 1 1 by what amounts to a maximum-likelihood method, i.e. which chooses the parameter values to maximize the probability of the observed values of y (given the values of x). The maximum probability possible is 1, so if you can find parameters which make P[y = 1] = 0 for x = 1, 2, ... , 7 and P[y = 1] for x = 8, 9, 10 then you have done it. This will be approximated as closely as you please for any value of a between 7 and 8, and sufficiently small values of b, since for such parameter values P[y = 1 ; x] - 0 for x a, and - 1 for x a. You therefore have a solution which is both indeterminate (any a such that 7 a 8) and singular (b - 0). So it will defeat standard estimation methods. That is the source of your problem. In a more general context, this is an instance of the linear separation problem in logistic regression (and similar methods, such a probit analysis). Basically, this situation implies that, according to the data, there is a perfect prediction for the results. There is no well-defined way of dealing with it; any approach starts from the proposition this perfect prediction is not a reasonable result in the context of my data, and continues by following up what you think should be meant by not a reasonable result. What this is likely to mean would be on the lines of b should not be that small, which then imposes upon you the need to be more specific about how small b may reasonably be. Then carry on from there (perhaps by fixing the value of b at different reasonable levels, and simply fitting a for each value of b). Hoping this helps ... but I'm wondering how it happens that you have such data ... ?? best wishes, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 15-Mar-07 Time: 19:38:51 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression TRY LOGISTF
If Ted is right, then one work-around is to use Firth's method for penalized log-likelihood. The technique is originally intended to reduce small sample bias. However, it's now being extended to deal with complete and quasi separation problems. I believe the library is called logistf but I haven't had a chance to try itI know the SAS version (called the fl macro) works fine. Reference -- http://www.meduniwien.ac.at/user/georg.heinze/techreps/tr2_2004.pdf Hope this helps, Jeff Miller University of Florida AlphaPoint05, Inc. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ted Harding Sent: Thursday, March 15, 2007 2:39 PM To: R-help Subject: Re: [R] logistic regression On 15-Mar-07 17:03:50, Milton Cezar Ribeiro wrote: Dear All, I would like adjust and know the R2 of following presence/absence data: x-1:10 y-c(0,0,0,0,0,0,0,1,1,1) I tryed use clogit (survival package) but it don´t worked. Any idea? miltinho You are trying to fit an equation P[y = 1 ; x] = exp((x-a)/b))/(1 + exp((x-a)/b)) to data x = 1 2 3 4 5 6 7 8 9 10 y = 0 0 0 0 0 0 0 1 1 1 by what amounts to a maximum-likelihood method, i.e. which chooses the parameter values to maximize the probability of the observed values of y (given the values of x). The maximum probability possible is 1, so if you can find parameters which make P[y = 1] = 0 for x = 1, 2, ... , 7 and P[y = 1] for x = 8, 9, 10 then you have done it. This will be approximated as closely as you please for any value of a between 7 and 8, and sufficiently small values of b, since for such parameter values P[y = 1 ; x] - 0 for x a, and - 1 for x a. You therefore have a solution which is both indeterminate (any a such that 7 a 8) and singular (b - 0). So it will defeat standard estimation methods. That is the source of your problem. In a more general context, this is an instance of the linear separation problem in logistic regression (and similar methods, such a probit analysis). Basically, this situation implies that, according to the data, there is a perfect prediction for the results. There is no well-defined way of dealing with it; any approach starts from the proposition this perfect prediction is not a reasonable result in the context of my data, and continues by following up what you think should be meant by not a reasonable result. What this is likely to mean would be on the lines of b should not be that small, which then imposes upon you the need to be more specific about how small b may reasonably be. Then carry on from there (perhaps by fixing the value of b at different reasonable levels, and simply fitting a for each value of b). Hoping this helps ... but I'm wondering how it happens that you have such data ... ?? best wishes, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 Date: 15-Mar-07 Time: 19:38:51 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Logistic regression for drugs Interactions
I have the model below, for which I run a logistic regression including the interaction term (NSAID*Diuretic) fit1=glm(resp ~ nsaid+diuretic+I(nsaid*diuretic), family= binomial,data=w) NSAID DiureticPresent Absent 0 0 185 6527 0 1 53 1444 1 0 42 1293 1 1 25 253 CoefficientsStd. Error z value Pr(|z|) (Intercept) -3.563350.07456 -47.794 2e-16 *** NSAID 0.13630 0.17361 0.7850.43242 Diuretic0.25847 0.15849 1.6310.10293 I(NSAID*Diuretic) 0.85407 0.30603 2.7910.00526 ** --- Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Odds ratio is 2.35 [ln(0.85407)] times higher when NSAID is present in addition to Diuretic. --- Odds ratio of Nausea when on Diuretic is exp(0.25847)= 1.29 and the odds ratio of Nausea when on NSAID is exp(0.13630)=1.14 Normally when we want to see the odds ratio of Nausea when a patient is on both drugs we multiply 1.29*1.14= 1.48 (is this correct? do we multiply or do we add?) But since the interaction term is significant then we take that into account? Does that mean that the odds ratio of the interaction is exp(0.25847)*exp(0.13630)*exp(0.85407)=3.486297 ? Or do we use additions? Thanks. -- View this message in context: http://www.nabble.com/Logistic-regression-for-drugs-Interactions-tf3404506.html#a9482343 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression on contingency table
Bingshan Li bli1 at bcm.tmc.edu writes: I am wondering if there is a way in R to fit logistic regression on contingency table. If I have original data, I can transform the data into a design matrix and then call glm to fit the regression. But now I have a 2x3 contingency table with first row for response 0 and second row for response 1, and the columns are 3 levels of predictor variable. The 3 levels are not ordinal though and indicator variables would be more appreciate. From Documentation of GLM: For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures. Dieter Menne __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression on contingency table
On Mon, 2007-03-05 at 15:31 +, Dieter Menne wrote: Bingshan Li bli1 at bcm.tmc.edu writes: I am wondering if there is a way in R to fit logistic regression on contingency table. If I have original data, I can transform the data into a design matrix and then call glm to fit the regression. But now I have a 2x3 contingency table with first row for response 0 and second row for response 1, and the columns are 3 levels of predictor variable. The 3 levels are not ordinal though and indicator variables would be more appreciate. From Documentation of GLM: For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures. Dieter Menne Just to expand on Dieter's comments, one trick to taking this approach is to coerce the contingency table you are starting with to a data frame, and then specify a 'weights' argument to glm(). Taking some dummy example data in a 2D contingency table: TAB X Y A B C 0 1 9 2 1 3 3 2 So we have X (IV) and Y (Response). Now, coerce TAB to a data frame. See ?as.data.frame.table and ?xtabs, which reverses the process back to a contingency table: DFT - as.data.frame(TAB) DFT Y X Freq 1 0 A1 2 1 A3 3 0 B9 4 1 B3 5 0 C2 6 1 C2 As an FYI, gets us back to 'TAB': xtabs(Freq ~ ., DFT) X Y A B C 0 1 9 2 1 3 3 2 Now create the model, using 'Freq' for the case weights: fit - glm(Y ~ X, weights = Freq, data = DFT, family = binomial) summary(fit) Call: glm(formula = Y ~ X, family = binomial, data = DFT, weights = Freq) Deviance Residuals: 1 2 3 4 5 6 -1.665 1.314 -2.276 2.884 -1.665 1.665 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept)1.099 1.155 0.951 0.3414 XB-2.197 1.333 -1.648 0.0994 . XC-1.099 1.528 -0.719 0.4720 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 26.920 on 5 degrees of freedom Residual deviance: 23.540 on 3 degrees of freedom AIC: 29.54 Number of Fisher Scoring iterations: 4 An alternative to the above is to use a function that I just posted in the past week for another query, called expand.dft(): expand.dft - function(x, na.strings = NA, as.is = FALSE, dec = .) { DF - sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ], simplify = FALSE) DF - subset(do.call(rbind, DF), select = -Freq) for (i in 1:ncol(DF)) { DF[[i]] - type.convert(as.character(DF[[i]]), na.strings = na.strings, as.is = as.is, dec = dec) } DF } This takes the data frame table 'DFT' from above and converts it back to the raw observations: DF - expand.dft(DFT) DF Y X 1 0 A 2 1 A 2.1 1 A 2.2 1 A 3 0 B 3.1 0 B 3.2 0 B 3.3 0 B 3.4 0 B 3.5 0 B 3.6 0 B 3.7 0 B 3.8 0 B 4 1 B 4.1 1 B 4.2 1 B 5 0 C 5.1 0 C 6 1 C 6.1 1 C As an FYI, gets us back to 'TAB': table(DF) X Y A B C 0 1 9 2 1 3 3 2 So, now we can use the normal approach for glm(): fit2 - glm(Y ~ X, data = DF, family = binomial) summary(fit2) Call: glm(formula = Y ~ X, family = binomial, data = DF) Deviance Residuals: Min 1Q Median 3Q Max -1.6651 -0.7585 -0.7585 0.8632 1.6651 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept)1.099 1.155 0.951 0.3414 XB-2.197 1.333 -1.648 0.0994 . XC-1.099 1.528 -0.719 0.4720 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 26.920 on 19 degrees of freedom Residual deviance: 23.540 on 17 degrees of freedom AIC: 29.54 Number of Fisher Scoring iterations: 4 Note of course that the DF's are different, though the Null and Residual Deviances and AIC are the same: Taking the latter approach, will of course enable you to make subsequent manipulations on the raw data if you wish. HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression on contingency table
Hi folks, I am wondering if there is a way in R to fit logistic regression on contingency table. If I have original data, I can transform the data into a design matrix and then call glm to fit the regression. But now I have a 2x3 contingency table with first row for response 0 and second row for response 1, and the columns are 3 levels of predictor variable. The 3 levels are not ordinal though and indicator variables would be more appreciate. Thanks a lot! Bingshan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Logistic regression model + precision/recall
Hi, I am using logistic regression model named lrm(Design) Rite now I was using Area Under Curve (AUC) for testing my model. But, now I have to calculate precision/recall of the model on test cases. For lrm, precision and recal would be simply defined with the help of 2 terms below: True Positive (TP) - Number of test cases where class 1 is given probability = 0.5. False Negative (FP) - Number of test cases where class 0 is given probability = 0.5. Precision = TP / (TP + FP) Recall = TP / ( Number of Positive Samples in test data) Any help is appreciated. I an write a long code with for loops and all, but is there any inbuild function or just few commands that would do the task. regards, Nitin [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression model + precision/recall
nitin jindal wrote: Hi, I am using logistic regression model named lrm(Design) Rite now I was using Area Under Curve (AUC) for testing my model. But, now I have to calculate precision/recall of the model on test cases. For lrm, precision and recal would be simply defined with the help of 2 terms below: True Positive (TP) - Number of test cases where class 1 is given probability = 0.5. False Negative (FP) - Number of test cases where class 0 is given probability = 0.5. Why 0.5? Precision = TP / (TP + FP) Recall = TP / ( Number of Positive Samples in test data) Those are improper scoring rules that can be tricked. If the outcome is rare (say 0.02 incidence) you could just predict that no one will have the outcome and be correct 0.98 of the time. I suggest validating the model for discrimination (e.g., AUC) and calibration. Frank Any help is appreciated. I an write a long code with for loops and all, but is there any inbuild function or just few commands that would do the task. regards, Nitin [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression model + precision/recall
On 1/24/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: Why 0.5? The probability has to adjusted based on some hit and trials. I just mentioned it as an example Those are improper scoring rules that can be tricked. If the outcome is rare (say 0.02 incidence) you could just predict that no one will have the outcome and be correct 0.98 of the time. I suggest validating the model for discrimination (e.g., AUC) and calibration. I just have to calculate precision/recall for rare outcome. If the positive outcome is rare ( say 0.02 incidence) and I predict it to be negative all the time, my recall would be 0, which is bad. So, precision and recall can take care of skewed data. Frank [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression model + precision/recall
nitin jindal wrote: On 1/24/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: Why 0.5? The probability has to adjusted based on some hit and trials. I just mentioned it as an example Using a cutoff is not a good idea unless the utility (loss) function is discontinuous and is the same for every subject (in the medical field utilities are almost never constant). And if you are using the data to find the cutoff, this will require bootstrapping to penalize for the cutoff not being pre-specified. Those are improper scoring rules that can be tricked. If the outcome is rare (say 0.02 incidence) you could just predict that no one will have the outcome and be correct 0.98 of the time. I suggest validating the model for discrimination (e.g., AUC) and calibration. I just have to calculate precision/recall for rare outcome. If the positive outcome is rare ( say 0.02 incidence) and I predict it to be negative all the time, my recall would be 0, which is bad. So, precision and recall can take care of skewed data. No, that is not clear. The overall classification error would only be 0.02 in that case. It is true though that one of the two conditional probabilities would not be good. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression model + precision/recall
On 1/24/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: nitin jindal wrote: Using a cutoff is not a good idea unless the utility (loss) function is discontinuous and is the same for every subject (in the medical field utilities are almost never constant). And if you are using the data to find the cutoff, this will require bootstrapping to penalize for the cutoff not being pre-specified. Thnx for this info. If I still have to use cutoff, I will do bootstrapping. I dont know any alternative to this to compute precision/recall for logistic regression model. No, that is not clear. The overall classification error would only be 0.02 in that case. It is true though that one of the two conditional probabilities would not be good. I forgot to mention that for my data, overall classification error is non-significant. I am only interested in precision/recall for rare outcome. nitin Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression model + precision/recall
Hi. Thnx a lot. I will try that. nitin On 1/24/07, Tobias Sing [EMAIL PROTECTED] wrote: Maybe ROCR might help you. You can visualize the prec/rec-trade-off across the range of all cutoffs: assuming your numerical predictions are in scores and the true class labels are in classes: pred - prediction( scores, classes ) perf - performance(pred, 'rec','prec') plot(perf) HTH, Tobias On 1/24/07, nitin jindal [EMAIL PROTECTED] wrote: Hi, I am using logistic regression model named lrm(Design) Rite now I was using Area Under Curve (AUC) for testing my model. But, now I have to calculate precision/recall of the model on test cases. For lrm, precision and recal would be simply defined with the help of 2 terms below: True Positive (TP) - Number of test cases where class 1 is given probability = 0.5. False Negative (FP) - Number of test cases where class 0 is given probability = 0.5. Precision = TP / (TP + FP) Recall = TP / ( Number of Positive Samples in test data) Any help is appreciated. I an write a long code with for loops and all, but is there any inbuild function or just few commands that would do the task. regards, Nitin [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Tobias Sing Computational Biology and Applied Algorithmics Max Planck Institute for Informatics Saarbrucken, Germany Phone: +49 681 9325 315 Fax: +49 681 9325 399 http://www.tobiassing.net [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression model + Cross-Validation
why not use lda{MASS} and it has cv=T option; it does loo, though. Or use randomForest. if you have to use lrm, then the following code might help: n.fold - 5 # 5-fold cv n.sample - 50 # assumed 50 samples s - sample(1:n.fold, size=n.sample, replace=T) for (i in 1:n.fold){ # create your training data and validation data for each fold trn - YOURWHOLEDATAFRAME[s!=i,] val - YOURWHOLEDATAFRAME[s==i,] # now do your own modeling using lrm # todo } HTH, weiwei On 1/21/07, nitin jindal [EMAIL PROTECTED] wrote: If validate.lrm does not has this option, do any other function has it. I will certainly look into your advice on cross validation. Thnx. nitin On 1/21/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: nitin jindal wrote: Hi, I am trying to cross-validate a logistic regression model. I am using logistic regression model (lrm) of package Design. f - lrm( cy ~ x1 + x2, x=TRUE, y=TRUE) val - validate.lrm(f, method=cross, B=5) val - validate(f, ...)# .lrm not needed My class cy has values 0 and 1. val variable will give me indicators like slope and AUC. But, I also need the vector of predicted values of class variable cy for each record while cross-validation, so that I can manually look at the results. So, is there any way to get those probabilities assigned to each class. regards, Nitin No, validate.lrm does not have that option. Manually looking at the results will not be easy when you do enough cross-validations. A single 5-fold cross-validation does not provide accurate estimates. Either use the bootstrap or repeat k-fold cross-validation between 20 and 50 times. k is often 10 but the optimum value may not be 10. Code for averaging repeated cross-validations is in http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf along with simulations of bootstrap vs. a few cross-validation methods for binary logistic models. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. Did you always know? No, I did not. But I believed... ---Matrix III __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression model + Cross-Validation
Hi, I am trying to cross-validate a logistic regression model. I am using logistic regression model (lrm) of package Design. f - lrm( cy ~ x1 + x2, x=TRUE, y=TRUE) val - validate.lrm(f, method=cross, B=5) My class cy has values 0 and 1. val variable will give me indicators like slope and AUC. But, I also need the vector of predicted values of class variable cy for each record while cross-validation, so that I can manually look at the results. So, is there any way to get those probabilities assigned to each class. regards, Nitin [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression model + Cross-Validation
nitin jindal wrote: Hi, I am trying to cross-validate a logistic regression model. I am using logistic regression model (lrm) of package Design. f - lrm( cy ~ x1 + x2, x=TRUE, y=TRUE) val - validate.lrm(f, method=cross, B=5) val - validate(f, ...)# .lrm not needed My class cy has values 0 and 1. val variable will give me indicators like slope and AUC. But, I also need the vector of predicted values of class variable cy for each record while cross-validation, so that I can manually look at the results. So, is there any way to get those probabilities assigned to each class. regards, Nitin No, validate.lrm does not have that option. Manually looking at the results will not be easy when you do enough cross-validations. A single 5-fold cross-validation does not provide accurate estimates. Either use the bootstrap or repeat k-fold cross-validation between 20 and 50 times. k is often 10 but the optimum value may not be 10. Code for averaging repeated cross-validations is in http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf along with simulations of bootstrap vs. a few cross-validation methods for binary logistic models. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression model + Cross-Validation
If validate.lrm does not has this option, do any other function has it. I will certainly look into your advice on cross validation. Thnx. nitin On 1/21/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: nitin jindal wrote: Hi, I am trying to cross-validate a logistic regression model. I am using logistic regression model (lrm) of package Design. f - lrm( cy ~ x1 + x2, x=TRUE, y=TRUE) val - validate.lrm(f, method=cross, B=5) val - validate(f, ...)# .lrm not needed My class cy has values 0 and 1. val variable will give me indicators like slope and AUC. But, I also need the vector of predicted values of class variable cy for each record while cross-validation, so that I can manually look at the results. So, is there any way to get those probabilities assigned to each class. regards, Nitin No, validate.lrm does not have that option. Manually looking at the results will not be easy when you do enough cross-validations. A single 5-fold cross-validation does not provide accurate estimates. Either use the bootstrap or repeat k-fold cross-validation between 20 and 50 times. k is often 10 but the optimum value may not be 10. Code for averaging repeated cross-validations is in http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf along with simulations of bootstrap vs. a few cross-validation methods for binary logistic models. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression model + Cross-Validation
nitin jindal wrote: If validate.lrm does not has this option, do any other function has it. I will certainly look into your advice on cross validation. Thnx. nitin Not that I know of, but easy to program. Frank On 1/21/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: nitin jindal wrote: Hi, I am trying to cross-validate a logistic regression model. I am using logistic regression model (lrm) of package Design. f - lrm( cy ~ x1 + x2, x=TRUE, y=TRUE) val - validate.lrm(f, method=cross, B=5) val - validate(f, ...)# .lrm not needed My class cy has values 0 and 1. val variable will give me indicators like slope and AUC. But, I also need the vector of predicted values of class variable cy for each record while cross-validation, so that I can manually look at the results. So, is there any way to get those probabilities assigned to each class. regards, Nitin No, validate.lrm does not have that option. Manually looking at the results will not be easy when you do enough cross-validations. A single 5-fold cross-validation does not provide accurate estimates. Either use the bootstrap or repeat k-fold cross-validation between 20 and 50 times. k is often 10 but the optimum value may not be 10. Code for averaging repeated cross-validations is in http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/logistic.val.pdf along with simulations of bootstrap vs. a few cross-validation methods for binary logistic models. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression packages
Hi All: I'm testing a set of data classification algorithms in this paper (www.stat.wisc.edu/~loh/treeprogs/quest1.7/mach1317.pdf ) I couldn't find such algorithms in R packages: 1. LOG: polytomous logistic regression (there was one in MASS library: multinom. But after I update MASS library, multinom was lost.) 2. POL: POLYCLASS algorithm. There is a S-Plus package(polyclass library) for this algorithm, so there should be a corresponding package in R, but I haven't found it so far. Any advice is appreciated. Best, Feng __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression packages
1. multinom is is the nnet package 2. There is a polyclass function in package polspline On 10/01/07, Feng Qiu [EMAIL PROTECTED] wrote: Hi All: I'm testing a set of data classification algorithms in this paper (www.stat.wisc.edu/~loh/treeprogs/quest1.7/mach1317.pdf ) I couldn't find such algorithms in R packages: 1. LOG: polytomous logistic regression (there was one in MASS library: multinom. But after I update MASS library, multinom was lost.) 2. POL: POLYCLASS algorithm. There is a S-Plus package(polyclass library) for this algorithm, so there should be a corresponding package in R, but I haven't found it so far. Any advice is appreciated. Best, Feng __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- = David Barron Said Business School University of Oxford Park End Street Oxford OX1 1HP -- = David Barron Said Business School University of Oxford Park End Street Oxford OX1 1HP __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression packages
Hi David: Thanks for you information. 2 further questions: 1. I found out that multinom is not doing politomous logistic regression, do you know which function does this? 2. the polyclass in polspline does polychotomous regression, while I'm looking for polytomous regression. Do you think these two are similar int erms of prediction? Best, Feng - Original Message - From: David Barron [EMAIL PROTECTED] To: r-help r-help@stat.math.ethz.ch Sent: Wednesday, January 10, 2007 12:14 PM Subject: Re: [R] logistic regression packages 1. multinom is is the nnet package 2. There is a polyclass function in package polspline On 10/01/07, Feng Qiu [EMAIL PROTECTED] wrote: Hi All: I'm testing a set of data classification algorithms in this paper (www.stat.wisc.edu/~loh/treeprogs/quest1.7/mach1317.pdf ) I couldn't find such algorithms in R packages: 1. LOG: polytomous logistic regression (there was one in MASS library: multinom. But after I update MASS library, multinom was lost.) 2. POL: POLYCLASS algorithm. There is a S-Plus package(polyclass library) for this algorithm, so there should be a corresponding package in R, but I haven't found it so far. Any advice is appreciated. Best, Feng __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- = David Barron Said Business School University of Oxford Park End Street Oxford OX1 1HP -- = David Barron Said Business School University of Oxford Park End Street Oxford OX1 1HP __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression in R - changing defaults
Hello, I was hoping for some advice in changing 2 defaults in a logistic regression. 1. It looks like the first category is the reference category? In the following syntax 'where' has 4 levels, how can I make the reference category the third category? model- glm(cbind(sucesses, failures) ~ where + who + firstep + dxnarrow + age + sex + medyear, family = binomial, data=life.use) model 2. Is it possible to round results to 4 decimal points? If so, what syntax is required? Any assistance is appreciated, Bob Green __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression in R - changing defaults
an option is to use ?relevel(), e.g., life.use$where. - relevel(life.use$where, 3) model- glm(cbind(sucesses, failures) ~ where. + who + firstep + dxnarrow + age + sex + medyear, family = binomial, data = life.use) print(model, digits = 1) print(model, digits = 2) print(model, digits = 3) print(model, digits = 4) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm - Original Message - From: Bob Green [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Tuesday, January 09, 2007 1:16 PM Subject: [R] logistic regression in R - changing defaults Hello, I was hoping for some advice in changing 2 defaults in a logistic regression. 1. It looks like the first category is the reference category? In the following syntax 'where' has 4 levels, how can I make the reference category the third category? model- glm(cbind(sucesses, failures) ~ where + who + firstep + dxnarrow + age + sex + medyear, family = binomial, data=life.use) model 2. Is it possible to round results to 4 decimal points? If so, what syntax is required? Any assistance is appreciated, Bob Green __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression with a sample missing subjects with a value of an independent variable
Dear R-help, I am trying to make logistic regression analysis using the R function glm, with the parameter family set to binomial, in order to use a logistic regression model. I have 70 samples. The dependent variables has two levels (0 and 1) and one of the independent variables has too two levels (0 and 1). The variables associate in the way shown in the table: Dependent 0 1 Independent 0 55 10 1 0 5 This gives a strong association evaluated by the fisher test (p-value = 0.0002481), but with the logistic regression it gives a p-value of 0.99 with very high values of estimate and standard error (respectively and -19.27 and 1769.26). Is there any way (other function, different setting of a parameter) to perform logistic regression analysis with these data with R? Thank you. Gabriele Stocco University of Trieste __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression with a sample missing subjects with a value of an independent variable
On Sat, 21 Oct 2006, Gabriele Stocco wrote: Dear R-help, I am trying to make logistic regression analysis using the R function glm, with the parameter family set to binomial, in order to use a logistic regression model. I have 70 samples. The dependent variables has two levels (0 and 1) and one of the independent variables has too two levels (0 and 1). The variables associate in the way shown in the table: Dependent 0 1 Independent 0 55 10 1 0 5 This gives a strong association evaluated by the fisher test (p-value = 0.0002481), but with the logistic regression it gives a p-value of 0.99 with very high values of estimate and standard error (respectively and -19.27 and 1769.26). Please see the comment at the bottom of this message, as your claims are not supported by any code. Is there any way (other function, different setting of a parameter) to perform logistic regression analysis with these data with R? fit - glm(matrix(c(55,0,10,5), 2, 2) ~ factor(c(0,1)), binomial()) fit0 - glm(matrix(c(55,0,10,5), 2, 2) ~ 1, binomial()) anova(fit0, fit, test=Chisq) Resid. Df Resid. Dev Df Deviance P(|Chi|) 1 1 16.929 2 0 2.208e-10 1 16.929 3.880e-05 is a reasonable way to do this. Beware the Hauck-Donner phenomenon (see e.g. MASS, the book) for t-tests of coefficients, although I do not get the values you quote. Since the expected values are low, you should not take the p value too seriously. Thank you. Gabriele Stocco University of Trieste __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression enquiry
I am hoping for some advie regarding the following scenario. I have data from 26 studies that I wanted to analyse using logistic regression. Given the data was from studies and not individuals I was unsure how I would perform this in R. When analysed in SPSS, weighting was used so that each study was included twice. Where use occurred, a value of 1 was assigned and was weighted by the value for the variable positive. In the second instance, where use did not occur a value of 0 was assigned and was weighted by the variable negative. Taking the first study below, use = 1 would be weighted by 28 and use =0, it would be weighted by 117 (the total study N =145). If this data is scrambled in transmission I can send a csv version. My questions: 1. is change required to the data format or is modification of the usual code required? 2. what is the best R package for performing logistic regression? Any assistance is much appreciated regards Bob Green In SPSS, each study was included I weighted the , though am unsure how I would format the data medyear where who dxbroad firstep standardage sex positivenegative 89 3 2 1 0 0 31.571 28 117.00 98 2 2 1 0 1 48.062 15 72.00 98 4 1 1 0 0 45.261 42 57.00 89 3 0 1 0 1 28.763 19 48.00 99 2 2 1 0 1 34.773 27 73.00 88 3 0 1 0 1 30.658 26 57.00 94 1 1 1 0 1 36.381 70 124.00 96 3 1 1 0 1 40.057 27 40.00 96 2 2 1 0 1 33.164 9 41.00 88 2 0 1 1 0 29.547 30 202.00 98 1 2 0 0 1 39.360 246 734.00 97 4 0 0 0 1 38.467 17 85.00 92 3 0 1 0 1 34.367 15 127.00 88 2 0 1 0 1 46 9 90.00 85 3 0 1 0 1 30.364 58 87.00 94 3 0 1 0 1 38.847 47 126.00 88 3 0 1 0 1 33.854 25 134.00 92 3 0 1 1 1 67 157.00 90 3 0 1 1 1 26.052 17 101.00 90 3 0 1 0 0 39 32.00 90 2 0 1 0 1 36.138 10 173.00 90 2 0 1 0 1 38.953 64 383.00 97 2 0 1 0 1 31.561 12 52.00 99 1 1 1 0 1 25 56.00 100 4 1 1 0 1 45.062 46 270.00 101 2 0 1 0 1 32.4100 33 92.00 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression enquiry
Dear Bob, If I follow this properly, this is just a binomial logistic regression, where, for each of your 26 observations, you have a certain number of successes (positive instances) and failures (negative ones). Fitting a binomial logit model is easily accommodated by the glm() function. Probably the simplest approach is to prepare one observation for each study which has the number of successes and failures for that study, and put these together as the columns of a two-column matrix on the left-hand side of the model; on the right-hand-side you'd have the usual specification for a model formula. Thus, something like model - glm(cbind(successes, failures) ~ ..., family=binomial, data=Dataset) (where the ... represents the RHS of your model) should do the trick. I hope this helps, John On Sun, 04 Jun 2006 20:39:29 +1000 Bob Green [EMAIL PROTECTED] wrote: I am hoping for some advie regarding the following scenario. I have data from 26 studies that I wanted to analyse using logistic regression. Given the data was from studies and not individuals I was unsure how I would perform this in R. When analysed in SPSS, weighting was used so that each study was included twice. Where use occurred, a value of 1 was assigned and was weighted by the value for the variable positive. In the second instance, where use did not occur a value of 0 was assigned and was weighted by the variable negative. Taking the first study below, use = 1 would be weighted by 28 and use =0, it would be weighted by 117 (the total study N =145). If this data is scrambled in transmission I can send a csv version. My questions: 1. is change required to the data format or is modification of the usual code required? 2. what is the best R package for performing logistic regression? Any assistance is much appreciated regards Bob Green In SPSS, each study was included I weighted the , though am unsure how I would format the data medyear where who dxbroad firstep standardage sex positivenegative 893 2 1 0 0 31.571 28 117.00 982 2 1 0 1 48.062 15 72.00 984 1 1 0 0 45.261 42 57.00 893 0 1 0 1 28.763 19 48.00 992 2 1 0 1 34.773 27 73.00 883 0 1 0 1 30.658 26 57.00 941 1 1 0 1 36.381 70 124.00 963 1 1 0 1 40.057 27 40.00 962 2 1 0 1 33.164 9 41.00 882 0 1 1 0 29.547 30 202.00 981 2 0 0 1 39.360 246 734.00 974 0 0 0 1 38.467 17 85.00 923 0 1 0 1 34.367 15 127.00 882 0 1 0 1 46 9 90.00 853 0 1 0 1 30.364 58 87.00 943 0 1 0 1 38.847 47 126.00 883 0 1 0 1 33.854 25 134.00 923 0 1 1 1 67 157.00 903 0 1 1 1 26.052 17 101.00 903 0 1 0 0 39 32.00 902 0 1 0 1 36.138 10 173.00 902 0 1 0 1 38.953 64 383.00 972 0 1 0 1 31.561 12 52.00 991 1 1 0 1 25 56.00 100 4 1 1 0 1 45.062 46 270.00 101 2 0 1 0 1 32.4100 33 92.00 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html John Fox Department of Sociology McMaster University Hamilton, Ontario, Canada http://socserv.mcmaster.ca/jfox/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic Regression - Results?
On Wed, 24 May 2006, Wojciech Gryc wrote: Hi, I use SPSS at work and have R installed both at work and on my home machine. I've been trying to do some logistic regressions in R and SPSS, but the results I'm getting are different. I've followed a few R tutorials, and with most of them, I get the following instructions: result - glm(z ~ x + y, family=binomial(logit)) In the case above, with three variables (z being dependent). In SPSS, I'm told to use Analyze - Regression - Binary Logistic, where I put x, y in Covariates and z in Dependent. Note that my values for x and y are either 1 or 0. The results I get from these two tests are different, however, and I was wondering why. Am I choosing the wrong commands? If not, why are the results different? Any help would be greatly appreciated, and please note that I have a limited amount of stats knowledge. If you show us an example and both outputs we may be able to help. `the results I'm getting are different' covers too many possibilities. (Please see the posting guide for how to prepare a question like this.) -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic Regression - Results?
Are you sure you are using the same contrasts in SPSS? You could have supplied us with your spss syntax. Marwan -- Marwan Khawaja http://staff.aub.edu.lb/~mk36 -- -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Wojciech Gryc Sent: Thursday, May 25, 2006 12:41 AM To: r-help@stat.math.ethz.ch Subject: [R] Logistic Regression - Results? Hi, I use SPSS at work and have R installed both at work and on my home machine. I've been trying to do some logistic regressions in R and SPSS, but the results I'm getting are different. I've followed a few R tutorials, and with most of them, I get the following instructions: result - glm(z ~ x + y, family=binomial(logit)) In the case above, with three variables (z being dependent). In SPSS, I'm told to use Analyze - Regression - Binary Logistic, where I put x, y in Covariates and z in Dependent. Note that my values for x and y are either 1 or 0. The results I get from these two tests are different, however, and I was wondering why. Am I choosing the wrong commands? If not, why are the results different? Any help would be greatly appreciated, and please note that I have a limited amount of stats knowledge. Thanks, Wojciech -- Five Minutes to Midnight: Youth on human rights and current affairs http://www.fiveminutestomidnight.org/ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Logistic Regression - Results?
Hi, I use SPSS at work and have R installed both at work and on my home machine. I've been trying to do some logistic regressions in R and SPSS, but the results I'm getting are different. I've followed a few R tutorials, and with most of them, I get the following instructions: result - glm(z ~ x + y, family=binomial(logit)) In the case above, with three variables (z being dependent). In SPSS, I'm told to use Analyze - Regression - Binary Logistic, where I put x, y in Covariates and z in Dependent. Note that my values for x and y are either 1 or 0. The results I get from these two tests are different, however, and I was wondering why. Am I choosing the wrong commands? If not, why are the results different? Any help would be greatly appreciated, and please note that I have a limited amount of stats knowledge. Thanks, Wojciech -- Five Minutes to Midnight: Youth on human rights and current affairs http://www.fiveminutestomidnight.org/ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression with,interactive parameters
hello I need to know more about logistic regression with interactive parameters. Could you provide document, link etc..? kind regards -- Ahmet Temiz Jeoloji Müh. Afet İşleri Genel Müdürlüğü Deprem Araştırma Dairesi Tel: (312) 287 89 51 veya (312) 287 26 80/1547 Faks: (312) 287 89 51 E. Posta: [EMAIL PROTECTED] www.deprem.gov.tr Ahmet Temiz Geological Eng. General Directorate of Disaster Affairs Earthquake Research Department Phone: +90 (312) 287 89 51 or (312) 287 26 80/1547 Fax: +90 (312) 287 89 51 E. Mail: [EMAIL PROTECTED] www.deprem.gov.tr -- This message has been scanned for viruses and\ dangerous con...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
Michael Dewey wrote: At 17:12 09/04/06, Ramón Casero Cañas wrote: I am not sure what the problem you really want to solve is but it seems that a) abnormality is rare b) the logistic regression predicts it to be rare. If you want a prediction system why not try different cut-offs (other than 0.5 on the probability scale) and perhaps plot sensitivity and specificity to help to choose a cut-off? Thanks for your suggestions, Michael. It took me some time to figure out how to do this in R (as trivial as it may be for others). Some comments about what I've done follow, in case anyone is interested. The problem is a) abnormality is rare (Prevalence=14%) and b) there is not much difference in the independent variable between abnormal and normal. So the logistic regression model predicts that P(abnormal) = 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to decide between normal/abnormal. But you are right, in that another cut-off point can be chosen. For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and Specificity=52%. They are pretty bad, although for clinical purposes I would say that Positive/Negative Predictive Values are more interesting. But then PPV=19% and NPV=90%, which isn't great. As an overall test of how good the model is for classification I have computed the area under the ROC, from your suggestion of using Sensitivity and Specificity. I couldn't find how to do this directly with R, so I implemented it myself (it's not difficult but I'm new here). I tried with package ROCR, but apparently it doesn't cover binary outcomes. The area under the ROC is 0.64, so I would say that even though the model seems to fit the data, it just doesn't allow acceptable discrimination, not matter what the cut-off point. I have also studied the effect of low prevalence. For this, I used option ran.gen in the boot function (package boot) to define a function that resamples the data so that it balances abnormal and normal cases. A logistic regression model is fitted to each replicate, to a parametric bootstrap, and thus compute the bias of the estimates of the model coefficients, beta0 and beta1. This shows very small bias for beta1, but a rather large bias for beta0. So I would say that prevalence has an effect on beta0, but not beta1. This is good, because a common measure like the odds ratio depends only on beta1. Cheers, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
On Sun, 2006-04-16 at 19:10 +0100, Ramón Casero Cañas wrote: Thanks for your suggestions, Michael. It took me some time to figure out how to do this in R (as trivial as it may be for others). Some comments about what I've done follow, in case anyone is interested. The problem is a) abnormality is rare (Prevalence=14%) and b) there is not much difference in the independent variable between abnormal and normal. So the logistic regression model predicts that P(abnormal) = 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to decide between normal/abnormal. But you are right, in that another cut-off point can be chosen. For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and Specificity=52%. They are pretty bad, although for clinical purposes I would say that Positive/Negative Predictive Values are more interesting. But then PPV=19% and NPV=90%, which isn't great. As an overall test of how good the model is for classification I have computed the area under the ROC, from your suggestion of using Sensitivity and Specificity. I couldn't find how to do this directly with R, so I implemented it myself (it's not difficult but I'm new here). I tried with package ROCR, but apparently it doesn't cover binary outcomes. The area under the ROC is 0.64, so I would say that even though the model seems to fit the data, it just doesn't allow acceptable discrimination, not matter what the cut-off point. I have also studied the effect of low prevalence. For this, I used option ran.gen in the boot function (package boot) to define a function that resamples the data so that it balances abnormal and normal cases. A logistic regression model is fitted to each replicate, to a parametric bootstrap, and thus compute the bias of the estimates of the model coefficients, beta0 and beta1. This shows very small bias for beta1, but a rather large bias for beta0. So I would say that prevalence has an effect on beta0, but not beta1. This is good, because a common measure like the odds ratio depends only on beta1. Cheers, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog The Epi package has function ROC that draws the ROC curve and computes the AUC among other things. Rick B. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
Ramón Casero Cañas wrote: Michael Dewey wrote: At 17:12 09/04/06, Ramón Casero Cañas wrote: I am not sure what the problem you really want to solve is but it seems that a) abnormality is rare b) the logistic regression predicts it to be rare. If you want a prediction system why not try different cut-offs (other than 0.5 on the probability scale) and perhaps plot sensitivity and specificity to help to choose a cut-off? Thanks for your suggestions, Michael. It took me some time to figure out how to do this in R (as trivial as it may be for others). Some comments about what I've done follow, in case anyone is interested. The problem is a) abnormality is rare (Prevalence=14%) and b) there is not much difference in the independent variable between abnormal and normal. So the logistic regression model predicts that P(abnormal) = 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to decide between normal/abnormal. But you are right, in that another cut-off point can be chosen. For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and Specificity=52%. They are pretty bad, although for clinical purposes I would say that Positive/Negative Predictive Values are more interesting. But then PPV=19% and NPV=90%, which isn't great. As an overall test of how good the model is for classification I have computed the area under the ROC, from your suggestion of using Sensitivity and Specificity. I couldn't find how to do this directly with R, so I implemented it myself (it's not difficult but I'm new here). I tried with package ROCR, but apparently it doesn't cover binary outcomes. The area under the ROC is 0.64, so I would say that even though the model seems to fit the data, it just doesn't allow acceptable discrimination, not matter what the cut-off point. I have also studied the effect of low prevalence. For this, I used option ran.gen in the boot function (package boot) to define a function that resamples the data so that it balances abnormal and normal cases. A logistic regression model is fitted to each replicate, to a parametric bootstrap, and thus compute the bias of the estimates of the model coefficients, beta0 and beta1. This shows very small bias for beta1, but a rather large bias for beta0. So I would say that prevalence has an effect on beta0, but not beta1. This is good, because a common measure like the odds ratio depends only on beta1. Cheers, This makes me think you are trying to go against maximum likelihood to optimize an improper criterion. Forcing a single cutpoint to be chosen seems to be at the heart of your problem. There's nothing wrong with using probabilities and letting the utility possessor make the final decision. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
Frank E Harrell Jr wrote: This makes me think you are trying to go against maximum likelihood to optimize an improper criterion. Forcing a single cutpoint to be chosen seems to be at the heart of your problem. There's nothing wrong with using probabilities and letting the utility possessor make the final decision. I agree, and in fact I was thinking along those lines, but I also needed a way of evaluating how good is the model to discriminate between abnormal and normal cases, as opposed to e.g. GOF. The only way I know of is using area under ROC (thus setting cut-off points), which also followed neatly from Michael Dewey comments. Any alternatives would be welcome :) -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
Ramón Casero Cañas wrote: Frank E Harrell Jr wrote: This makes me think you are trying to go against maximum likelihood to optimize an improper criterion. Forcing a single cutpoint to be chosen seems to be at the heart of your problem. There's nothing wrong with using probabilities and letting the utility possessor make the final decision. I agree, and in fact I was thinking along those lines, but I also needed a way of evaluating how good is the model to discriminate between abnormal and normal cases, as opposed to e.g. GOF. The only way I know of is using area under ROC (thus setting cut-off points), which also followed neatly from Michael Dewey comments. Any alternatives would be welcome :) To get the ROC area you don't need to do any of that, and as you indicated, it is a good discrimination measure. The lrm function in the Design package gives it to you automatically (C index), and you can also get it with the Hmisc package's somers2 and rcorr.cens functions. ROC area is highly related to the Wilcoxon 2-sample test statistic for comparing cases and non-cases. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression model with non-integer weights
At 17:12 09/04/06, Ramón Casero Cañas wrote: I have not seen a reply to this so far apologies if I missed something. When fitting a logistic regression model using weights I get the following warning data.model.w - glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT) Warning message: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) Details follow *** I have a binary dependent variable of abnormality ABN = T, F, T, T, F, F, F... and a continous predictor TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ... As the number of abnormal cases (ABN==T) is only 14%, and there is large overlapping between abnormal and normal cases, the logistic regression found by glm is always much closer to the normal cases than for the abnormal cases. In particular, the probability of abnormal is at most 0.4. Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 0.7607 0.7196 1.057 0.2905 TR2 -1.4853 0.4328 -3.432 0.0006 *** --- I would like to compensate for the fact that the a priori probability of abnormal cases is so low. I have created a weight vector I am not sure what the problem you really want to solve is but it seems that a) abnormality is rare b) the logistic regression predicts it to be rare. If you want a prediction system why not try different cut-offs (other than 0.5 on the probability scale) and perhaps plot sensitivity and specificity to help to choose a cut-off? WEIGHT - ABN WEIGHT[ ABN == TRUE ] - 1 / na / 2 WEIGHT[ ABN == FALSE ] - 1 / nn / 2 so that all weights add up to 1, where ``na'' is the number of abnormal cases, and ``nn'' is the number of normal cases. That is, normal cases have less weight in the model fitting because there are so many. But then I get the warning message at the beginning of this email, and I suspect that I'm doing something wrong. Must weights be integers, or at least greater than one? Regards, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog Michael Dewey http://www.aghmed.fsnet.co.uk __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression model with non-integer weights
When fitting a logistic regression model using weights I get the following warning data.model.w - glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT) Warning message: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) Details follow *** I have a binary dependent variable of abnormality ABN = T, F, T, T, F, F, F... and a continous predictor TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ... As the number of abnormal cases (ABN==T) is only 14%, and there is large overlapping between abnormal and normal cases, the logistic regression found by glm is always much closer to the normal cases than for the abnormal cases. In particular, the probability of abnormal is at most 0.4. Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 0.7607 0.7196 1.057 0.2905 TR2 -1.4853 0.4328 -3.432 0.0006 *** --- I would like to compensate for the fact that the a priori probability of abnormal cases is so low. I have created a weight vector WEIGHT - ABN WEIGHT[ ABN == TRUE ] - 1 / na / 2 WEIGHT[ ABN == FALSE ] - 1 / nn / 2 so that all weights add up to 1, where ``na'' is the number of abnormal cases, and ``nn'' is the number of normal cases. That is, normal cases have less weight in the model fitting because there are so many. But then I get the warning message at the beginning of this email, and I suspect that I'm doing something wrong. Must weights be integers, or at least greater than one? Regards, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wiki http://www.robots.ox.ac.uk/~rcasero/blog __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Logistic regression - confidence intervals
Please forgive a rather naïve question... Could someone please give a quick explanation for the differences in conf intervals achieved via confint.glm (based on profile liklihoods) and the intervals achieved using the Design library. For example, the intervals in the following two outputs are different. library(Design) x = rnorm(100) y = gl(2,50) d = data.frame(x = x, y = y) dd = datadist(d); options(datadist = 'dd') m1 = lrm(y~x, data = d) summary(m1) m2 = glm(y~x, family = binomial, data = d) confint(m2) I have spent time trying to figure this out via archives, but have not had much luck. Regards Stephen __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression model selection with overdispersed/autocorrelated data
Thanks for pointing out the aod package and the beta-binomial logistic models Renaud. While I see how betabinom could be applied to some of our other analyses , I don't see how it can be used in our habitat selection analysis where individual locations are coded as 0 or 1 rather than proportions. Gee models (geeglm from geepack) could be used for our analyses. Even though these models are fit using maximum likelihood estimation, they do not solve our model selection problem. Beta-coefficients from gee, glm, glmm's, and lrm are nearly identical. The only thing that varies is the variance-covariance matrix and the resulting standard errors. Consequently, the deviances should be similar because predicted values (p) are calculated from the beta-coefficients. For an individual data point, the loglikelihood = y * log(p) + (1 - y) * log(1-p) and the deviance = -2 * sum(loglikelihoods). Consequently, the difference in deviance between two models is amplified by autocorrelated data and causes models to be overparamaterized when using AIC or likelihood ratio tests. I am curious how others select models with autocorrelated data. Thanks for your help, Jesse Renaud Lancelot renaud.lancelot@To: [EMAIL PROTECTED] [EMAIL PROTECTED] gmail.com cc: r-help@stat.math.ethz.ch Subject: Re: [R] Logistic regression model selection with overdispersed/autocorrelated 31/01/2006 01:02 data If you're not interested in fitting caribou-specific responses, you can use beta-binomial logistic models. There are several package available for this purpose on CRAN, among which aod. Because these models are fitted using maximum-likelihood methods, you can use AIC (or other information criteria) to compare different models. Best, Renaud 2006/1/30, [EMAIL PROTECTED] [EMAIL PROTECTED]: I am creating habitat selection models for caribou and other species with data collected from GPS collars. In my current situation the radio-collars recorded the locations of 30 caribou every 6 hours. I am then comparing resources used at caribou locations to random locations using logistic regression (standard habitat analysis). The data is therefore highly autocorrelated and this causes Type I error two ways – small standard errors around beta-coefficients and over-paramaterization during model selection. Robust standard errors are easily calculated by block-bootstrapping the data using animal as a cluster with the Design library, however I haven't found a satisfactory solution for model selection. A couple options are: 1. Using QAIC where the deviance is divided by a variance inflation factor (Burnham Anderson). However, this VIF can vary greatly depending on the data set and the set of covariates used in the global model. 2. Manual forward stepwise regression using both changes in deviance and robust p-values for the beta-coefficients. I have been looking for a solution to this problem for a couple years and would appreciate any advice. Jesse __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Renaud LANCELOT Département Elevage et Médecine Vétérinaire (EMVT) du CIRAD Directeur adjoint chargé des affaires scientifiques CIRAD, Animal Production and Veterinary Medicine Department Deputy director for scientific affairs Campus international de Baillarguet TA 30 / B (Bât. B, Bur. 214) 34398 Montpellier Cedex 5 - France Tél +33 (0)4 67 59 37 17 Secr. +33 (0)4 67 59 39 04 Fax +33 (0)4 67 59 37 95 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression - confidence intervals
On Wed, 8 Feb 2006, Cox, Stephen wrote: Please forgive a rather naïve question... Could someone please give a quick explanation for the differences in conf intervals achieved via confint.glm (based on profile liklihoods) and the intervals achieved using the Design library. Well, the Design library is not giving you confidence intervals for parameters, is it? (Since there is no summary method for lrm it is a long haul to find out what it is giving you, which I leave to you.) For example, the intervals in the following two outputs are different. library(Design) x = rnorm(100) y = gl(2,50) d = data.frame(x = x, y = y) dd = datadist(d); options(datadist = 'dd') m1 = lrm(y~x, data = d) summary(m1) m2 = glm(y~x, family = binomial, data = d) confint(m2) I have spent time trying to figure this out via archives, but have not had much luck. Regards Stephen -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression - confidence intervals
Cox, Stephen wrote: Please forgive a rather naïve question... Could someone please give a quick explanation for the differences in conf intervals achieved via confint.glm (based on profile liklihoods) and the intervals achieved using the Design library. For example, the intervals in the following two outputs are different. library(Design) x = rnorm(100) y = gl(2,50) d = data.frame(x = x, y = y) dd = datadist(d); options(datadist = 'dd') m1 = lrm(y~x, data = d) summary(m1) m2 = glm(y~x, family = binomial, data = d) confint(m2) I have spent time trying to figure this out via archives, but have not had much luck. Regards Stephen Design uses Wald(large sample normality of parameter estimates) -based confidence intervals. These are good for most situations but profile confidence intervals are preferred. Someday I'll make Design do those. One advantage to Wald statistics is that they extend readily to cluster sampling (e.g., using cluster sandwich covariance estimators) and other complications (e.g., adjustment of variances for multiple imputation), whereas likelihood ratio statistics do not (unless e.g. you have an explicit model for the correlation structure or other facits of the model). Also note that confint is probably giving a confidence interval for a one-unit change in x whereas summary.Design is computing an interquartile-range effect (difference in x-values is shown in the summary output). When posting a nice simulated example it's best to do set.seed(something) so everyone will get the same data. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression model selection with overdispersed/autocorrelated data
If you're not interested in fitting caribou-specific responses, you can use beta-binomial logistic models. There are several package available for this purpose on CRAN, among which aod. Because these models are fitted using maximum-likelihood methods, you can use AIC (or other information criteria) to compare different models. Best, Renaud 2006/1/30, [EMAIL PROTECTED] [EMAIL PROTECTED]: I am creating habitat selection models for caribou and other species with data collected from GPS collars. In my current situation the radio-collars recorded the locations of 30 caribou every 6 hours. I am then comparing resources used at caribou locations to random locations using logistic regression (standard habitat analysis). The data is therefore highly autocorrelated and this causes Type I error two ways – small standard errors around beta-coefficients and over-paramaterization during model selection. Robust standard errors are easily calculated by block-bootstrapping the data using animal as a cluster with the Design library, however I haven't found a satisfactory solution for model selection. A couple options are: 1. Using QAIC where the deviance is divided by a variance inflation factor (Burnham Anderson). However, this VIF can vary greatly depending on the data set and the set of covariates used in the global model. 2. Manual forward stepwise regression using both changes in deviance and robust p-values for the beta-coefficients. I have been looking for a solution to this problem for a couple years and would appreciate any advice. Jesse __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Renaud LANCELOT Département Elevage et Médecine Vétérinaire (EMVT) du CIRAD Directeur adjoint chargé des affaires scientifiques CIRAD, Animal Production and Veterinary Medicine Department Deputy director for scientific affairs Campus international de Baillarguet TA 30 / B (Bât. B, Bur. 214) 34398 Montpellier Cedex 5 - France Tél +33 (0)4 67 59 37 17 Secr. +33 (0)4 67 59 39 04 Fax +33 (0)4 67 59 37 95 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression model selection with overdispersed/autocorrelated data
[EMAIL PROTECTED] wrote: I am creating habitat selection models for caribou and other species with data collected from GPS collars. In my current situation the radio-collars recorded the locations of 30 caribou every 6 hours. I am then comparing resources used at caribou locations to random locations using logistic regression (standard habitat analysis). The data is therefore highly autocorrelated and this causes Type I error two ways – small standard errors around beta-coefficients and over-paramaterization during model selection. Robust standard errors are easily calculated by block-bootstrapping the data using “animal” as a cluster with the Design library, however I haven’t found a satisfactory solution for model selection. A couple options are: 1. Using QAIC where the deviance is divided by a variance inflation factor (Burnham Anderson). However, this VIF can vary greatly depending on the data set and the set of covariates used in the global model. 2. Manual forward stepwise regression using both changes in deviance and robust p-values for the beta-coefficients. I have been looking for a solution to this problem for a couple years and would appreciate any advice. Jesse Frank E Harrell Jr wrote: If you must do non-subject-matter-driven model selection, look at the fastbw function in Design, which will use the cluster bootstrap variance matrix. Frank Thanks for the tip. I didn't know that the fastbw function could account for the clustered variance. For others, the code to run such a model from the Design library would be: model.1 - lrm(y ~ x1+x2+x3+x4, data=data, x=T,y=T) # create model model.2 - bootcov(model.1, cluster=data$animal, B=1)# calculate robust variance matrix fastbw(model.2) # backward step-wise selection. Later we will examine individual caribou responses to trails (subject-specific model selection). For this we plan to use mixed effects models (lmer). Is this what you would also recommend? I look forward to reading the new edition of your book when it is published. Jesse __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Logistic regression model selection with overdispersed/autocorrelated data
I am creating habitat selection models for caribou and other species with data collected from GPS collars. In my current situation the radio-collars recorded the locations of 30 caribou every 6 hours. I am then comparing resources used at caribou locations to random locations using logistic regression (standard habitat analysis). The data is therefore highly autocorrelated and this causes Type I error two ways – small standard errors around beta-coefficients and over-paramaterization during model selection. Robust standard errors are easily calculated by block-bootstrapping the data using “animal” as a cluster with the Design library, however I haven’t found a satisfactory solution for model selection. A couple options are: 1. Using QAIC where the deviance is divided by a variance inflation factor (Burnham Anderson). However, this VIF can vary greatly depending on the data set and the set of covariates used in the global model. 2. Manual forward stepwise regression using both changes in deviance and robust p-values for the beta-coefficients. I have been looking for a solution to this problem for a couple years and would appreciate any advice. Jesse __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression model selection with overdispersed/autocorrelated data
[EMAIL PROTECTED] wrote: I am creating habitat selection models for caribou and other species with data collected from GPS collars. In my current situation the radio-collars recorded the locations of 30 caribou every 6 hours. I am then comparing resources used at caribou locations to random locations using logistic regression (standard habitat analysis). The data is therefore highly autocorrelated and this causes Type I error two ways – small standard errors around beta-coefficients and over-paramaterization during model selection. Robust standard errors are easily calculated by block-bootstrapping the data using “animal” as a cluster with the Design library, however I haven’t found a satisfactory solution for model selection. A couple options are: 1. Using QAIC where the deviance is divided by a variance inflation factor (Burnham Anderson). However, this VIF can vary greatly depending on the data set and the set of covariates used in the global model. 2. Manual forward stepwise regression using both changes in deviance and robust p-values for the beta-coefficients. I have been looking for a solution to this problem for a couple years and would appreciate any advice. Jesse If you must do non-subject-matter-driven model selection, look at the fastbw function in Design, which will use the cluster bootstrap variance matrix. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression to select genes and estimate cutoff point?
You could take a look at www.bioconductor.org limma would be a good starting point. hth ido __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression to select genes and estimate cutoff point?
Lingsheng Dong wrote: Hi, all, I am new to R or even to statistics. Not sure if the question has a answer. But I couldn't find a straight forward answer in the help mailing list. I need use MicroArray data to select several diagnostic genes between Normal samples and Tumor samples and use these genes to predict unknow samples. Since the sample size is so small and data doesn't follow normal distribution, I am thinking to use logistic regression instead of Student T test to select genes. To make the problem simpler, I assume each gene is independent to each other without interactions. My questions is how I should build up the model: one model for each gene or a multiple variable model to include all genes? Which is the test to compare the discrimination power of each gene? I am thinking it is Wald statistic for the multiple variable model and Maximum likelihood for the single gene models? Am I correct? To estimate the cutoff point, I guess the answer is the gene expression when p=0.5 in the model. Am I on the right direction? Any suggestion is appreciated! Thanks a lot. Lingsheng Just a comment: Do you not have a statistician to work with at your institution? You are new to statistics and are asking a question that would be very difficult to deal with for someone with a PhD in statistics and 20 years of experience. Some of the issues involved are multiple comparisons, false discovery rate, shrinkage, array geometry effects, nonparametric vs. parametric statistics, stability of selected genes, discovery validation, ... -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Logistic regression to select genes and estimate cutoff point?
Hi, all, I am new to R or even to statistics. Not sure if the question has a answer. But I couldn't find a straight forward answer in the help mailing list. I need use MicroArray data to select several diagnostic genes between Normal samples and Tumor samples and use these genes to predict unknow samples. Since the sample size is so small and data doesn't follow normal distribution, I am thinking to use logistic regression instead of Student T test to select genes. To make the problem simpler, I assume each gene is independent to each other without interactions. My questions is how I should build up the model: one model for each gene or a multiple variable model to include all genes? Which is the test to compare the discrimination power of each gene? I am thinking it is Wald statistic for the multiple variable model and Maximum likelihood for the single gene models? Am I correct? To estimate the cutoff point, I guess the answer is the gene expression when p=0.5 in the model. Am I on the right direction? Any suggestion is appreciated! Thanks a lot. Lingsheng [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression with constrained coefficients?
I am trying to automatically construct a distance function from a training set in order to use it to cluster another data set. The variables are nominal. One variable is a class variable having two values; it is kept separate from the others. I have a method which constructs a distance matrix for the levels of a nominal variable in the context of the other variables. I want to construct a linear combination of these which gives me a distance between whole cases that is well associated with the class variable, in that combined distance between two cases large = they most likely belong to different classes. So from my training set I construct a set of (d1(x1,y1), ..., dn(xn,yn), x_class != y_class) rows bound together as a data frame (actually I construct it by columns), and then the obvious thing to try was glm(different.class ~ ., family = binomial(), data = distance.frame) The thing is that this gives me both positve and negative coefficients, whereas the linear combination is only guaranteed to be a metric if the coefficients are all non-negative. There are four fairly obvious ways to deal with that: (1) just force the negative coefficients to 0 and hope. This turns out to work rather well, but still... (2) keep all the coefficients but take max(0, linear combination of distances). This turns out to work rather well, but still... (3) Drop the variables with negative coefficients from the model, refit, and iterate until no negative coefficients remain. This can hardly be said to work; sometimes nearly all the variables are dropped. (4) Use a version of glm() that will let me constrain the coefficients to be non-negative. I *have* searched the R-help archives, and I see that the question about logistic regression with constrained coefficients has come up before, but it didn't really get a satisfactory answer. I've also searched the documentation of more contributed packages than I could possibly understand. There is obviously some way to do this using R's general non-linear optimisation functions. However, I don't know how to formulate logistic regression that way. This whole thing is heuristic. I am not hell-bent on (ab?)using logistic regression this way. It was just an obvious thing to try. Suggestions for other means to the same end will be welcome. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression with constrained coefficients?
On Fri, 9 Dec 2005, Richard A. O'Keefe wrote: I am trying to automatically construct a distance function from a training set in order to use it to cluster another data set. The variables are nominal. One variable is a class variable having two values; it is kept separate from the others. I have a method which constructs a distance matrix for the levels of a nominal variable in the context of the other variables. I want to construct a linear combination of these which gives me a distance between whole cases that is well associated with the class variable, in that combined distance between two cases large = they most likely belong to different classes. So from my training set I construct a set of (d1(x1,y1), ..., dn(xn,yn), x_class != y_class) rows bound together as a data frame (actually I construct it by columns), and then the obvious thing to try was glm(different.class ~ ., family = binomial(), data = distance.frame) The thing is that this gives me both positve and negative coefficients, whereas the linear combination is only guaranteed to be a metric if the coefficients are all non-negative. There are four fairly obvious ways to deal with that: (1) just force the negative coefficients to 0 and hope. This turns out to work rather well, but still... (2) keep all the coefficients but take max(0, linear combination of distances). This turns out to work rather well, but still... (3) Drop the variables with negative coefficients from the model, refit, and iterate until no negative coefficients remain. This can hardly be said to work; sometimes nearly all the variables are dropped. (4) Use a version of glm() that will let me constrain the coefficients to be non-negative. I *have* searched the R-help archives, and I see that the question about logistic regression with constrained coefficients has come up before, but it didn't really get a satisfactory answer. I've also searched the documentation of more contributed packages than I could possibly understand. There is obviously some way to do this using R's general non-linear optimisation functions. However, I don't know how to formulate logistic regression that way. There is a worked example in MASS (the book) p.445, including adding constraints. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Logistic Regression using glm
Hello everyone, I am currently teaching an intermediate stats. course at UCSD Extension using R. We are using Venables and Ripley as the primary text for the course, with Freund Wilson's Statistical Methods as a secondary reference. I recently gave a homework assignment on logistic regression, and I had a question about glm. Let n be the number of trials, p be the estimated sample proportion, and w be the standard binomial weights n*p*(1-p). If you perform output - glm(p ~ x, family = binomial, weights = n) you get a different result than if you perform the logit transformation manually on p and perform output - lm(logit(p) ~ x, weights = w), where logit(p) is either obtained from R with qlogis(p) or from a manual computation of ln(p/1-p). The difference seems to me to be too large to be roundoff error. The only thing I can guess is that the application of the weights in glm is different than in a manual computation. Can anyone explain the difference in results? Daniel Pick Principal Daniel Pick Scientific Software Consulting San Diego, CA E-Mail: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic Regression using glm
You're fitting two different models. The latter is saying: logit(p)=x+e, where e is a normal error, so that logit(p) is normal. lm fits a Linear Model, which uses normal error. The former says that p is Bernoulli; and p~Bernoulli does not imply logit(p) is normal. A Generalized Linear Model has different options for specifying the random component. Agresti's Categorical Data Analysis lays out the details very well. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Daniel Pick Sent: Tuesday, October 11, 2005 12:22 PM To: r-help@stat.math.ethz.ch Subject: [R] Logistic Regression using glm Hello everyone, I am currently teaching an intermediate stats. course at UCSD Extension using R. We are using Venables and Ripley as the primary text for the course, with Freund Wilson's Statistical Methods as a secondary reference. I recently gave a homework assignment on logistic regression, and I had a question about glm. Let n be the number of trials, p be the estimated sample proportion, and w be the standard binomial weights n*p*(1-p). If you perform output - glm(p ~ x, family = binomial, weights = n) you get a different result than if you perform the logit transformation manually on p and perform output - lm(logit(p) ~ x, weights = w), where logit(p) is either obtained from R with qlogis(p) or from a manual computation of ln(p/1-p). The difference seems to me to be too large to be roundoff error. The only thing I can guess is that the application of the weights in glm is different than in a manual computation. Can anyone explain the difference in results? Daniel Pick Principal Daniel Pick Scientific Software Consulting San Diego, CA E-Mail: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- -- This e-mail and any attachments may be confidential or\ ...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic Regression using glm
One of these is modelling the mean of the logit of p, the other is modelling the logit of the mean of p. They aren't the same. -thomas On Tue, 11 Oct 2005, Daniel Pick wrote: Hello everyone, I am currently teaching an intermediate stats. course at UCSD Extension using R. We are using Venables and Ripley as the primary text for the course, with Freund Wilson's Statistical Methods as a secondary reference. I recently gave a homework assignment on logistic regression, and I had a question about glm. Let n be the number of trials, p be the estimated sample proportion, and w be the standard binomial weights n*p*(1-p). If you perform output - glm(p ~ x, family = binomial, weights = n) you get a different result than if you perform the logit transformation manually on p and perform output - lm(logit(p) ~ x, weights = w), where logit(p) is either obtained from R with qlogis(p) or from a manual computation of ln(p/1-p). The difference seems to me to be too large to be roundoff error. The only thing I can guess is that the application of the weights in glm is different than in a manual computation. Can anyone explain the difference in results? Daniel Pick Principal Daniel Pick Scientific Software Consulting San Diego, CA E-Mail: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression with nominal predictors
This sounds to me like a great research project that could be answered relatively easily with a Monte Carlo study. An execllent mathematician might be able to produce simple theoretical limits on the error from using ranks or normal scores, but such limits would likely be much wider than one could get in a typical case using Monte Carlo. And Monte Carlo in R is normally fairly easy even for quite complicated situations. spencer graves Ramón Casero Cañas wrote: (Sorry for obvious mistakes, as I am quite a newby with no Statistics background). My question is going to be what is the gain of logistic regression over odds ratios when none of the input variables is continuous. My experiment: Outcome: ordinal scale, ``quality'' (QUA=1,2,3) Predictors: ``segment'' (SEG) and ``stress'' (STR). SEG is nominal scale with 24 levels, and STR is dychotomous (0,1). Considering the outcome continuous, two-way ANOVA with aov(as.integer(QUA) ~ SEG * STR) doesn't find evidence of interaction between SEG and STR, and they are significant on their own. This is the result that we would expect from clinical knowledge. I use xtabs(~QUA+SEG, data=data2.df, subset=STR==0) xtabs(~QUA+SEG, data=data2.df, subset=STR==0) for the contingency tables. There are zero cells, and for some values of SEG, there is only one none-zero cell, i.e. some values of SEG determine the output with certainty. So initially I was thinking of a proportional odds logistic regression model, but following Hosmer and Lemeshow [1], zero cells are problematic. So I take out of the data table the deterministic values of SEG, and I pool QUA=2 and QUA=3, and now I have a dychotomous outcome (QUA = Good/Bad) and no zero cells. The following model doesn't find evidence of interaction glm(QUA ~ STR * SEG, data=data3.df, family=binomial) so I go for glm(QUA ~ STR + SEG, data=data3.df, family=binomial) (I suppose that what glm does is to create design variables for SEG, where 0 0 ... 0 is for the first value of SEG, 1 0 ... 0 for the second value, 0 1 0 ... 0 for the third, etc). Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -1.085e+00 1.933e-01-5.614 1.98e-08 *** STR.L2.112e-01 6.373e-02 3.314 0.000921 *** SEGP2C.MI -9.869e-01 3.286e-01-3.004 0.002669 ** SEGP2C.AI -1.306e+00 3.585e-01-3.644 0.000269 *** SEGP2C.AA -1.743e+00 4.123e-01-4.227 2.37e-05 *** [shortened] SEGP4C.ML -5.657e-01 2.990e-01-1.892 0.058485 . SEGP4C.BL -2.908e-16 2.734e-01 -1.06e-15 1.00 SEGSAX.MS1.092e-01 2.700e-01 0.405 0.685772 SEGSAX.MAS -5.441e-16 2.734e-01 -1.99e-15 1.00 SEGSAX.MA7.130e-01 2.582e-01 2.761 0.005758 ** SEGSAX.ML1.199e+00 2.565e-01 4.674 2.96e-06 *** SEGSAX.MP1.313e+00 2.570e-01 5.108 3.26e-07 *** SEGSAX.MI8.865e-01 2.569e-01 3.451 0.000558 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3462.0 on 3123 degrees of freedom Residual deviance: 3012.6 on 3101 degrees of freedom AIC: 3058.6 Number of Fisher Scoring iterations: 6 Even though some coefficients have no evidence of statistical significance, the model requires them from a clinical point of view. At this point, the question would be how to interpret these results, and what advantage they offer over odds ratios. From [1] I can understand that in the case of a dychotomous and a continuous predictor, you can adjust for the continuous variable. But when all predictors are dychotomous (due to the design variables), I don't quite see the effect of adjustment. Wouldn't it be better just to split the data in two groups (STR=0 and STR=1), and instead of using logistic regression, use odds ratios for each value of SEG? Cheers, Ramón. [1] D.W. Hosmer and S. Lemeshow. ``Applied Logistic Regression''. John-Wiley. 2000. -- Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA [EMAIL PROTECTED] www.pdf.com http://www.pdf.com Tel: 408-938-4420 Fax: 408-280-7915 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression with nominal predictors
(Sorry for obvious mistakes, as I am quite a newby with no Statistics background). My question is going to be what is the gain of logistic regression over odds ratios when none of the input variables is continuous. My experiment: Outcome: ordinal scale, ``quality'' (QUA=1,2,3) Predictors: ``segment'' (SEG) and ``stress'' (STR). SEG is nominal scale with 24 levels, and STR is dychotomous (0,1). Considering the outcome continuous, two-way ANOVA with aov(as.integer(QUA) ~ SEG * STR) doesn't find evidence of interaction between SEG and STR, and they are significant on their own. This is the result that we would expect from clinical knowledge. I use xtabs(~QUA+SEG, data=data2.df, subset=STR==0) xtabs(~QUA+SEG, data=data2.df, subset=STR==0) for the contingency tables. There are zero cells, and for some values of SEG, there is only one none-zero cell, i.e. some values of SEG determine the output with certainty. So initially I was thinking of a proportional odds logistic regression model, but following Hosmer and Lemeshow [1], zero cells are problematic. So I take out of the data table the deterministic values of SEG, and I pool QUA=2 and QUA=3, and now I have a dychotomous outcome (QUA = Good/Bad) and no zero cells. The following model doesn't find evidence of interaction glm(QUA ~ STR * SEG, data=data3.df, family=binomial) so I go for glm(QUA ~ STR + SEG, data=data3.df, family=binomial) (I suppose that what glm does is to create design variables for SEG, where 0 0 ... 0 is for the first value of SEG, 1 0 ... 0 for the second value, 0 1 0 ... 0 for the third, etc). Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -1.085e+00 1.933e-01-5.614 1.98e-08 *** STR.L2.112e-01 6.373e-02 3.314 0.000921 *** SEGP2C.MI -9.869e-01 3.286e-01-3.004 0.002669 ** SEGP2C.AI -1.306e+00 3.585e-01-3.644 0.000269 *** SEGP2C.AA -1.743e+00 4.123e-01-4.227 2.37e-05 *** [shortened] SEGP4C.ML -5.657e-01 2.990e-01-1.892 0.058485 . SEGP4C.BL -2.908e-16 2.734e-01 -1.06e-15 1.00 SEGSAX.MS1.092e-01 2.700e-01 0.405 0.685772 SEGSAX.MAS -5.441e-16 2.734e-01 -1.99e-15 1.00 SEGSAX.MA7.130e-01 2.582e-01 2.761 0.005758 ** SEGSAX.ML1.199e+00 2.565e-01 4.674 2.96e-06 *** SEGSAX.MP1.313e+00 2.570e-01 5.108 3.26e-07 *** SEGSAX.MI8.865e-01 2.569e-01 3.451 0.000558 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3462.0 on 3123 degrees of freedom Residual deviance: 3012.6 on 3101 degrees of freedom AIC: 3058.6 Number of Fisher Scoring iterations: 6 Even though some coefficients have no evidence of statistical significance, the model requires them from a clinical point of view. At this point, the question would be how to interpret these results, and what advantage they offer over odds ratios. From [1] I can understand that in the case of a dychotomous and a continuous predictor, you can adjust for the continuous variable. But when all predictors are dychotomous (due to the design variables), I don't quite see the effect of adjustment. Wouldn't it be better just to split the data in two groups (STR=0 and STR=1), and instead of using logistic regression, use odds ratios for each value of SEG? Cheers, Ramón. [1] D.W. Hosmer and S. Lemeshow. ``Applied Logistic Regression''. John-Wiley. 2000. -- Ramón Casero Cañas web:http://www.robots.ox.ac.uk/~rcasero/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression with constrained parameter ??
On Sat, 27 Aug 2005, Benn Fine wrote: I want to fit a logistic regression model under a specified null hypothesis, say Ho:Beta_k=1 Is there a way to constrain a parameter in glm.fit ?? You should be calling glm(). Then you can use offset() in your formula. (You can also use it as an argument, but please don't as it is only partially supported.) [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html PLEASE do (and notice what is says about HTML mail). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression: categorical value, and multinomial
I have two questions: 1. If I want to do a binomial logit, how to handle the categorical response variable? Data for the response variables are not numerical, but text. 2. What if I want to do a multinomial logit, still with categorical response variable? The variable has 5 non-numerical response levels, I have to do it with a multinomial logit. Any input is highly appreciated! Thanks! Ed __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression: categorical value, and multinomial
d-data.frame(y=sample(letters[1:2],100,T),x=rnorm(100)) head(d,10) y x 1 b 0.55915620 2 b 0.87575380 3 b -0.13093156 4 b 0.75925729 5 b 0.40233427 6 b 1.34685918 7 a 1.10487752 8 a -2.27456596 9 a 1.65919787 10 b 0.05095611 glm(y~x,data=d,family=binomial) Call: glm(formula = y ~ x, family = binomial, data = d) Coefficients: (Intercept)x 0.2771 0.5348 Degrees of Freedom: 99 Total (i.e. Null); 98 Residual Null Deviance: 136.1 Residual Deviance: 129.5AIC: 133.5 === 2005-07-28 00:22:55 您在来信中写道:=== I have two questions: 1. If I want to do a binomial logit, how to handle the categorical response variable? Data for the response variables are not numerical, but text. 2. What if I want to do a multinomial logit, still with categorical response variable? The variable has 5 non-numerical response levels, I have to do it with a multinomial logit. Any input is highly appreciated! Thanks! Ed __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html = = = = = = = = = = = = = = = = = = = = 2005-07-28 -- Deparment of Sociology Fudan University Blog:http://sociology.yculblog.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression: categorical value, and multinomial
Dear Ed, See ?glm for fitting binomial logit models, and ?multinom (in the nnet package) for multinomial logit models. Neither function will handle a character (text) variable as the response, but you could easily convert the variable to a factor. John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Haibo Huang Sent: Wednesday, July 27, 2005 11:23 AM To: r-help@stat.math.ethz.ch Subject: [R] logistic regression: categorical value, and multinomial I have two questions: 1. If I want to do a binomial logit, how to handle the categorical response variable? Data for the response variables are not numerical, but text. 2. What if I want to do a multinomial logit, still with categorical response variable? The variable has 5 non-numerical response levels, I have to do it with a multinomial logit. Any input is highly appreciated! Thanks! Ed __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression asymptote problem
R-helpers, I have a question about logistic regressions. Consider a case where you have binary data that reaches an asymptote that is not 1, maybe its 0.5. Can I still use a logistic regression to fit a curve to this data? If so, how can I do this in R. As far as I can figure out, using a logit link function assumes that the asymptote is at y = 1. An example. Consider the following data: tmp - structure(list(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), yes = c(0, 0, 0, 2, 1, 14, 24, 15, 23, 18, 22, 20, 14, 17 ), no = c(94, 101, 95, 80, 81, 63, 51, 56, 30, 38, 31, 18, 21, 20)), .Names = c(x, yes, no), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), class = data.frame) where x is the independent variable, and yes and no are counts of events. plotting the data you can see that the data seem to reach an asymptote at around y=0.5. using glm to fit a logistic regression it is easily seen that it does not fit well. tmp.glm - glm(cbind(yes,no) ~ x, data = tmp, family = binomial(link = logit)) plot(tmp.glm$fitted, type = l, ylim = c(0,1)) par(new=T) plot(tmp$yes / (tmp$yes + tmp$no), ylim = c(0,1)) Any suggestions would be greatly appreciated. Cheers, Kevin -- Kevin J Emerson Center for Ecology and Evolutionary Biology 1210 University of Oregon University of Oregon Eugene, OR 97403 [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression asymptote problem
I saw a standard overdispersed binomial. In particular, I saw NO evidence of saturation at 0.5 or anything below 1. I did the following: tmp$N - tmp$yes+tmp$no with(tmp, plot(x, yes)) with(tmp, plot(x, yes/N)) tmp.glm - glm(cbind(yes,no) ~ x, data = tmp, family = binomial(link =logit)) tmp.glmq - glm(cbind(yes,no) ~ x, data = tmp, family = quasibinomial(link =logit)) summary(tmp.glm) summary(tmp.glmq) plot(tmp.glm) plot(tmp.glmq) # Test the statistical significance of the Dispersion parameter pchisq(summary(tmp.glmq)$dispersion*12, 12, lower=FALSE) hope this helps. spencer graves Kevin J Emerson wrote: R-helpers, I have a question about logistic regressions. Consider a case where you have binary data that reaches an asymptote that is not 1, maybe its 0.5. Can I still use a logistic regression to fit a curve to this data? If so, how can I do this in R. As far as I can figure out, using a logit link function assumes that the asymptote is at y = 1. An example. Consider the following data: tmp - structure(list(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), yes = c(0, 0, 0, 2, 1, 14, 24, 15, 23, 18, 22, 20, 14, 17 ), no = c(94, 101, 95, 80, 81, 63, 51, 56, 30, 38, 31, 18, 21, 20)), .Names = c(x, yes, no), row.names = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), class = data.frame) where x is the independent variable, and yes and no are counts of events. plotting the data you can see that the data seem to reach an asymptote at around y=0.5. using glm to fit a logistic regression it is easily seen that it does not fit well. tmp.glm - glm(cbind(yes,no) ~ x, data = tmp, family = binomial(link = logit)) plot(tmp.glm$fitted, type = l, ylim = c(0,1)) par(new=T) plot(tmp$yes / (tmp$yes + tmp$no), ylim = c(0,1)) Any suggestions would be greatly appreciated. Cheers, Kevin -- Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA [EMAIL PROTECTED] www.pdf.com http://www.pdf.com Tel: 408-938-4420 Fax: 408-280-7915 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression - using polys and products of features
On Fri, 17 Jun 2005 07:39:30 +1000 Stephen Choularton [EMAIL PROTECTED] wrote: Hi I can get all my features by doing this: logistic.model = glm(similarity ~ ., family=binomial, data = cData[3001:3800,]) I can get the product of all my features by this: logistic.model = glm(similarity ~ . ^ 2, family=binomial, data = cData[3001:3800,]) I don't seem to be able to get polys by doing this: logistic.model = glm(similarity ~ poly(.,2), family=binomial, data = cData[3001:3800,]) Error in poly(., 2) : Object . not found How can I get polys? What do the warnings mean when I do this: logistic.model = glm(similarity ~ . + . ^ 2, family=binomial, data = cData[3001:3800,]) Warning messages: 1: Algorithm did not converge in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, 2: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, .^2 means all the 2-order and 1-order terms,so .+.^2 is meaningless. How can I do this? logistic.model = glm(similarity ~ . + . ^ 2 + poly(.,2), family=binomial, data = cData[3001:3800,]) Thanks Stephen -- Internal Virus Database is out-of-date. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Department of Sociology Fudan University,Shanghai Blog:http://sociology.yculblog.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression - using polys and products of features
Hi I can get all my features by doing this: logistic.model = glm(similarity ~ ., family=binomial, data = cData[3001:3800,]) I can get the product of all my features by this: logistic.model = glm(similarity ~ . ^ 2, family=binomial, data = cData[3001:3800,]) I don't seem to be able to get polys by doing this: logistic.model = glm(similarity ~ poly(.,2), family=binomial, data = cData[3001:3800,]) Error in poly(., 2) : Object . not found How can I get polys? What do the warnings mean when I do this: logistic.model = glm(similarity ~ . + . ^ 2, family=binomial, data = cData[3001:3800,]) Warning messages: 1: Algorithm did not converge in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, 2: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, How can I do this? logistic.model = glm(similarity ~ . + . ^ 2 + poly(.,2), family=binomial, data = cData[3001:3800,]) Thanks Stephen -- Internal Virus Database is out-of-date. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression with more than two choices
Dear all R-users, I am a new user of R and I am trying to build a discrete choice model (with more than two alternatives A, B, C and D) using logistic regression. I have data that describes the observed choice probabilities and some background information. An example below describes the data: SexAge pr(A) pr(B) pr(C) pr(D) ... 1 11 0.5 0.5 0 0 1 40 1 0 0 0 0 34 0 0 0 1 0 64 0.1 0.5 0.2 0.2 ... You can use multinom() Here is an exemple For example let this matrix to be analyzed: male female aborted factor 10 12 1 1.2 14 14 4 1.3 15 12 3 1.4 The data are to be entered in a text file like this: output factor n m 1.2 10 f 1.2 12 a 1.2 1 m 1.3 14 f 1.3 14 a 1.3 4 m 1.4 15 f 1.4 12 a 1.4 3 library(MASS) dt.plr - multinom(output ~ factor, data=dt, weights=n, maxit=1000) dt.pr1-predict(dt.plr, , type=probs) dt.pr1 I have been able to model a case with only two alternatives A and not A by using glm(). I do not know what functions are available to estimate such a model with more than two alternatives. Multinom() is one possibility, but it only allows the use of binary 0/1-data instead of observed probabilities. Did I understand this correctly? Additionally, I am willing to use different independent variables for the different alternatives in the model. Formally, I mean that: Pr(A)=exp(uA)/(exp(uA)+exp(uB)+exp(uC)+exp(uD) Pr(B)=exp(uB)/(exp(uA)+exp(uB)+exp(uC)+exp(uD) ... where uA, uB, uC and uD are linear functions with different independent variables, e.g. uA=alpha_A1*Age, uB=alpha_B1*Sex. Do you know how to estimate this type of models in R? I don't think it is possible... (at least simply, without writing all the script !) Note that I don't undrestand where the residual deviance from multinom() come from. I cant find the logic. Marc -- __ Marc Girondot, Pr Laboratoire Ecologie, Systématique et Evolution Equipe de Conservation des Populations et des Communautés CNRS, ENGREF et Université Paris-Sud 11 , UMR 8079 Bâtiment 362 91405 Orsay Cedex, France Tel: 33 1 (0)1.69.15.72.30 Fax: 33 1 (0)1 69 15 56 96 e-mail: [EMAIL PROTECTED] Web: http://www.ese.u-psud.fr/epc/conservation/Marc.html Skype: girondot Fax in US: 1-425-732-6934 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression (glm binary)
Hi I am looking for a couple of pointers using glm (family = binary). 1. I want to add all the products of my predictive features as additional features (and I have 23 of them). Is there some easy way to add them? 2. I want to drop each feature in turn and get the most significant, then drop two and get the next most significant, etc. Is there some function that allows me to do this? Thanks Stephen -- No virus found in this outgoing message. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] logistic regression (glm binary)
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stephen Choularton Sent: Tuesday, June 07, 2005 11:49 PM To: R Help Subject: [R] logistic regression (glm binary) Hi I am looking for a couple of pointers using glm (family = binary). Do you mean binomial instead of binary? 1. I want to add all the products of my predictive features as additional features (and I have 23 of them). Is there some easy way to add them? Probably something along the line: dat - data.frame(y=sample(0:1, 100, replace=TRUE), matrix(runif(300), ncol=3)) fm - glm(y ~ .^2, family=binomial, data=dat) summary(fm) Call: glm(formula = y ~ .^2, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.654 -1.175 0.608 1.116 1.651 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept)3.264 1.536 2.125 0.0336 * X1-3.379 2.026 -1.668 0.0953 . X2-4.659 2.244 -2.077 0.0378 * X3-3.531 2.060 -1.714 0.0865 . X1:X2 4.535 2.775 1.634 0.1022 X1:X3 2.123 2.639 0.804 0.4212 X2:X3 4.315 2.746 1.571 0.1161 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 137.99 on 99 degrees of freedom Residual deviance: 131.84 on 93 degrees of freedom AIC: 145.84 Number of Fisher Scoring iterations: 3 2. I want to drop each feature in turn and get the most significant, then drop two and get the next most significant, etc. Is there some function that allows me to do this? Not that I know of, and most likely for a very, very good reason... Andy Thanks Stephen -- No virus found in this outgoing message. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression
Hi I am working on corpora of automatically recognized utterances, looking for features that predict error in the hypothesis the recognizer is proposing. I am using the glm functions to do logistic regression. I do this type of thing: * logistic.model = glm(formula = similarity ~., family = binomial, data = data) and end up with a model: summary(logistic.model) Call: glm(formula = similarity ~ ., family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -3.1599 0.2334 0.3307 0.4486 1.2471 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 11.1923783 4.6536898 2.405 0.01617 * length-0.3529775 0.2416538 -1.461 0.14410 meanPitch -0.0203590 0.0064752 -3.144 0.00167 ** minimumPitch 0.0257213 0.0053092 4.845 1.27e-06 *** maximumPitch -0.0003454 0.0030008 -0.115 0.90838 meanF1 0.0137880 0.0047035 2.931 0.00337 ** meanF2 0.0040238 0.0041684 0.965 0.33439 meanF3-0.0075497 0.0026751 -2.822 0.00477 ** meanF4-0.0005362 0.0007443 -0.720 0.47123 meanF5-0.0001560 0.0003936 -0.396 0.69187 ratioF2ToF10.2668678 2.8926149 0.092 0.92649 ratioF3ToF11.7339087 1.7655757 0.982 0.32607 jitter-5.2571384 10.8043359 -0.487 0.62656 shimmer -2.3040826 3.0581950 -0.753 0.45120 percentUnvoicedFrames 0.1959342 1.3041689 0.150 0.88058 numberOfVoiceBreaks -0.1022074 0.0823266 -1.241 0.21443 percentOfVoiceBreaks -0.0590097 1.2580202 -0.047 0.96259 meanIntensity -0.0765124 0.0612008 -1.250 0.21123 minimumIntensity 0.1037980 0.0331899 3.127 0.00176 ** maximumIntensity -0.0389995 0.0430368 -0.906 0.36484 ratioIntensity-2.0329346 1.2420286 -1.637 0.10168 noSyllsIntensity 0.1157678 0.0947699 1.222 0.22187 startSpeech0.0155578 0.1343117 0.116 0.90778 speakingRate -0.2583315 0.1648337 -1.567 0.11706 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2462.3 on 4310 degrees of freedom Residual deviance: 2209.5 on 4287 degrees of freedom AIC: 2257.5 Number of Fisher Scoring iterations: 6 I have seen models where almost all the features are showing one in a thousand significance but I accept that I could improve my model by normalizing some of the features (some are left skewed and I understand that I will get a better fir by taking their logs, for example). What really worries me is that the logistic function produces predictions that appear to fall well outside 0 to 1. If I make a dataset of the medians of the above features and use my logistic.model on it, it produces a figure of: x = predict(logistic.model, medians) x [1] 2.82959 which is well outside the range of 0 to 1. The actual distribution of all the predictions is: summary(pred) Min. 1st Qu. MedianMean 3rd Qu.Max. -1.516 2.121 2.720 2.731 3.341 6.387 I can get the model to give some sort of prediction by doing this: pred = predict(logistic.model, data) pred[pred = 1.5] = 0 pred[pred 1.5] = 1 t = table(pred, data[,24]) t pred 01 0 102 253 1 255 3701 classAgreement(t) $diag [1] 0.8821619 $kappa [1] 0.949 $rand [1] 0.7920472 $crand [1] 0.1913888 but as you can see I am using a break point well outside the range 0 to 1 and the kappa is rather low (I think). I am a bit of a novice in this, and the results worry me. Can anyone comment if the results look strange, or if they know I am doing something wrong? Stephen -- No virus found in this outgoing message. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression
predict.glm by default produces predictions on the scale of the linear predictors. If in a logistic regression, you want the predictions to be on the response scale [0,1], use x - predict(logistic.model, medians, type=response) for example. See ?predict.glm for details. Cheers, Simon. Hi I am working on corpora of automatically recognized utterances, looking for features that predict error in the hypothesis the recognizer is proposing. I am using the glm functions to do logistic regression. I do this type of thing: * logistic.model = glm(formula = similarity ~., family = binomial, data = data) and end up with a model: summary(logistic.model) Call: glm(formula = similarity ~ ., family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -3.1599 0.2334 0.3307 0.4486 1.2471 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 11.1923783 4.6536898 2.405 0.01617 * length-0.3529775 0.2416538 -1.461 0.14410 meanPitch -0.0203590 0.0064752 -3.144 0.00167 ** minimumPitch 0.0257213 0.0053092 4.845 1.27e-06 *** maximumPitch -0.0003454 0.0030008 -0.115 0.90838 meanF1 0.0137880 0.0047035 2.931 0.00337 ** meanF2 0.0040238 0.0041684 0.965 0.33439 meanF3-0.0075497 0.0026751 -2.822 0.00477 ** meanF4-0.0005362 0.0007443 -0.720 0.47123 meanF5-0.0001560 0.0003936 -0.396 0.69187 ratioF2ToF10.2668678 2.8926149 0.092 0.92649 ratioF3ToF11.7339087 1.7655757 0.982 0.32607 jitter-5.2571384 10.8043359 -0.487 0.62656 shimmer -2.3040826 3.0581950 -0.753 0.45120 percentUnvoicedFrames 0.1959342 1.3041689 0.150 0.88058 numberOfVoiceBreaks -0.1022074 0.0823266 -1.241 0.21443 percentOfVoiceBreaks -0.0590097 1.2580202 -0.047 0.96259 meanIntensity -0.0765124 0.0612008 -1.250 0.21123 minimumIntensity 0.1037980 0.0331899 3.127 0.00176 ** maximumIntensity -0.0389995 0.0430368 -0.906 0.36484 ratioIntensity-2.0329346 1.2420286 -1.637 0.10168 noSyllsIntensity 0.1157678 0.0947699 1.222 0.22187 startSpeech0.0155578 0.1343117 0.116 0.90778 speakingRate -0.2583315 0.1648337 -1.567 0.11706 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2462.3 on 4310 degrees of freedom Residual deviance: 2209.5 on 4287 degrees of freedom AIC: 2257.5 Number of Fisher Scoring iterations: 6 I have seen models where almost all the features are showing one in a thousand significance but I accept that I could improve my model by normalizing some of the features (some are left skewed and I understand that I will get a better fir by taking their logs, for example). What really worries me is that the logistic function produces predictions that appear to fall well outside 0 to 1. If I make a dataset of the medians of the above features and use my logistic.model on it, it produces a figure of: x = predict(logistic.model, medians) x [1] 2.82959 which is well outside the range of 0 to 1. The actual distribution of all the predictions is: summary(pred) Min. 1st Qu. MedianMean 3rd Qu.Max. -1.516 2.121 2.720 2.731 3.341 6.387 I can get the model to give some sort of prediction by doing this: pred = predict(logistic.model, data) pred[pred = 1.5] = 0 pred[pred 1.5] = 1 t = table(pred, data[,24]) t pred 01 0 102 253 1 255 3701 classAgreement(t) $diag [1] 0.8821619 $kappa [1] 0.949 $rand [1] 0.7920472 $crand [1] 0.1913888 but as you can see I am using a break point well outside the range 0 to 1 and the kappa is rather low (I think). I am a bit of a novice in this, and the results worry me. Can anyone comment if the results look strange, or if they know I am doing something wrong? Stephen -- No virus found in this outgoing message. Checked by AVG Anti-Virus. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat. Visiting Fellow School of Botany Zoology The Australian National University Canberra ACT 0200 Australia T: +61 2 6125 8057 email: [EMAIL PROTECTED] F: +61 2 6125 5573 CRICOS Provider # 00120C __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression: differential importance of regressors
Hi, All. I have a logistic regression model that I have run. The question came up: which of these regressors is more important than another? (I'm using Design) Logistic Regression Model lrm(formula = iconicgesture ~ ST + SSP + magnitude + Condition + Expertise, data = d) CoefS.E. Wald Z P Intercept -3.2688 0.2854 -11.45 0. ST 2.0871 0.2730 7.64 0. SSP0.7454 0.3031 2.46 0.0139 magnitude -0.9905 0.6284 -1.58 0.1150 Condition 0.9506 0.2932 3.24 0.0012 Expertise 0.8508 0.2654 3.21 0.0013 The real question is that, since both ST and SSP load significantly into the model, how do I show that ST has a bigger/smaller/similar effect than SSP? thanks in advance! greg __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression: differential importance of regressors
Greg Trafton wrote: Hi, All. I have a logistic regression model that I have run. The question came up: which of these regressors is more important than another? (I'm using Design) Logistic Regression Model lrm(formula = iconicgesture ~ ST + SSP + magnitude + Condition + Expertise, data = d) CoefS.E. Wald Z P Intercept -3.2688 0.2854 -11.45 0. ST 2.0871 0.2730 7.64 0. SSP0.7454 0.3031 2.46 0.0139 magnitude -0.9905 0.6284 -1.58 0.1150 Condition 0.9506 0.2932 3.24 0.0012 Expertise 0.8508 0.2654 3.21 0.0013 The real question is that, since both ST and SSP load significantly into the model, how do I show that ST has a bigger/smaller/similar effect than SSP? thanks in advance! greg One thing you can do is to compute what proportion of the total likelihood ratio chi-square statistic is due to each variable by removing one at a time and looking at difference in Model L.R. (assuming both have same observations missing). Note that you are making heavy linearity assumptions. You can also use the bootstrap to get a confidence interval on the rank of the chi-square statistic a variable has among all competing chi-squares. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression weights problem
Hi All, I have a problem with weighted logistic regression. I have a number of SNPs and a case/control scenario, but not all genotypes are as guaranteed as others, so I am using weights to downsample the importance of individuals whose genotype has been heavily inferred. My data is quite big, but with a dummy example: status - c(1,1,1,0,0) SNPs - matrix( c(1,0,1,0,0,0,0,1,0,1,0,1,0,1,1), ncol =3) weight - c(0.2, 0.1, 1, 0.8, 0.7) glm(status ~ SNPs, weights = weight, family = binomial) Call: glm(formula = status ~ SNPs, family = binomial, weights = weight) Coefficients: (Intercept)SNPs1SNPs2SNPs3 -2.079 42.282 -18.964 NA Degrees of Freedom: 4 Total (i.e. Null); 2 Residual Null Deviance: 3.867 Residual Deviance: 0.6279 AIC: 6.236 Warning messages: 1: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 2: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, NB I do not get warning (2) for my data so I'll completely disregard it. Warning (1) looks suspiciously like a multiplication of my C/C status by the weights... what exacly is glm doing with the weight vector? In any case, how would I go about weighting my individuals in a logistic regression? Regards, Federico Calboli -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression weights problem
On Wed, 13 Apr 2005, Federico Calboli wrote: I have a problem with weighted logistic regression. I have a number of SNPs and a case/control scenario, but not all genotypes are as guaranteed as others, so I am using weights to downsample the importance of individuals whose genotype has been heavily inferred. My data is quite big, but with a dummy example: status - c(1,1,1,0,0) SNPs - matrix( c(1,0,1,0,0,0,0,1,0,1,0,1,0,1,1), ncol =3) weight - c(0.2, 0.1, 1, 0.8, 0.7) glm(status ~ SNPs, weights = weight, family = binomial) Call: glm(formula = status ~ SNPs, family = binomial, weights = weight) Coefficients: (Intercept)SNPs1SNPs2SNPs3 -2.079 42.282 -18.964 NA Degrees of Freedom: 4 Total (i.e. Null); 2 Residual Null Deviance: 3.867 Residual Deviance: 0.6279 AIC: 6.236 Warning messages: 1: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 2: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, NB I do not get warning (2) for my data so I'll completely disregard it. Warning (1) looks suspiciously like a multiplication of my C/C status by the weights... what exacly is glm doing with the weight vector? Using it in the GLM definition. If you specify 0=y_i=1 and weights a_i, this is how you specify Binomial(a_i, a_iy_i). Look up any book on GLMs and see what it says about the binomial. E.g. MASS4 pp. 184, 190. In any case, how would I go about weighting my individuals in a logistic regression? Use the cbind(yes, no) form of specification. Note though that the `weights' in a GLM are case weights and not arbitrary downweighting factors and aspects of the output (e.g. AIC, anova) depend on this. A different implementation of (differently) weighted GLM is svyglm() in package 'survey'. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression weights problem
On Wed, 2005-04-13 at 17:42 +0100, Prof Brian Ripley wrote: Use the cbind(yes, no) form of specification. Note though that the `weights' in a GLM are case weights and not arbitrary downweighting factors and aspects of the output (e.g. AIC, anova) depend on this. A different implementation of (differently) weighted GLM is svyglm() in package 'survey'. I tried to use cbind() on a slightly modified dummy set to get get rid of all warnings and that's what I got: status - c(1,1,1,0,0) SNPs - matrix( c(1,0,1,0,0,1,0,1,0,1,0,1,0,1,1), ncol =3) weight - c(0.2, 0.1, 1, 0.8, 0.7) using cbind() glm(cbind(status, 1-status) ~ SNPs, weights = weight, family = binomial) Call: glm(formula = cbind(status, 1 - status) ~ SNPs, family = binomial, weights = weight) Coefficients: (Intercept)SNPs1SNPs2SNPs3 -2.079 43.132 -19.487 NA Degrees of Freedom: 4 Total (i.e. Null); 2 Residual Null Deviance: 3.867 Residual Deviance: 0.6279 AIC: 6.236 ### NOT using cbind() glm(status~ SNPs, weights = weight, family = binomial) Call: glm(formula = status ~ SNPs, family = binomial, weights = weight) Coefficients: (Intercept)SNPs1SNPs2SNPs3 -2.079 42.944 -19.366 NA Degrees of Freedom: 4 Total (i.e. Null); 2 Residual Null Deviance: 3.867 Residual Deviance: 0.6279 AIC: 6.236 Warning message: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) ## The anova() call of the cbind() model seems happy: ### mod - glm(cbind(status, 1-status) ~ SNPs, weights = weight, family = binomial) anova(mod, test = Chi) Analysis of Deviance Table Model: binomial, link: logit Response: cbind(status, 1 - status) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(|Chi|) NULL 4 3.8673 SNPs 2 3.2394 2 0.62790.1980 # The real data modeldoes not show warnings when I use cbind() but it still shows warning when I call anova() on the model: ### anova(glm(cbind(sta, 1-sta) ~ (X1 + X2 + X3 + X4 + X5) * breed, family= binomial, weights =igf$we, igf),test=Chi) Analysis of Deviance Table Model: binomial, link: logit Response: cbind(sta, 1 - sta) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(|Chi|) NULL330 372.89 X1 1 0.14 329 372.75 0.71 X2 1 0.26 328 372.49 0.61 X3 1 0.001121 327 372.49 0.97 X4 1 2.63 326 369.86 0.10 X5 1 1.87 325 367.99 0.17 breed 3 5.41 322 362.58 0.14 X1:breed 3 2.98 319 359.60 0.39 X2:breed 3 1.21 316 358.39 0.75 X3:breed 2 1.32 314 357.08 0.52 X4:breed 3 1.75 311 355.33 0.63 X5:breed 2 2.38 309 352.95 0.30 Warning messages: 1: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 2: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 3: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 4: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 5: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 6: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 7: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 8: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 9: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) 10: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) ## Why such inconsistency? Regards, Federico Calboli -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
Trevor Wiens wrote: On Thu, 10 Mar 2005 16:19:41 -0600 Frank E Harrell Jr [EMAIL PROTECTED] wrote: The goodness of fit test only works on prespecified models. It is not valid when stepwise variable selection is used (unless perhaps you use alpha=0.5). Perhaps I'm blind, but I can't find any reference to an alpha parameter on the help page for stepAIC. It runs fine however when I set the parameter and produces as model with the same right hand variables as without. Can you tell me what it is ? T What I mean is the effective significance level for keeping a variable in the model. Using AIC for one degree of freedom variables is effectively using an alpha of 0.16 if I recall properly. But I hope you got the point that resid(fit,'gof') as with most goodness of fit tests assumes prespecified models. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
On Fri, 11 Mar 2005 07:32:30 -0500 Frank E Harrell Jr [EMAIL PROTECTED] wrote: What I mean is the effective significance level for keeping a variable in the model. Using AIC for one degree of freedom variables is effectively using an alpha of 0.16 if I recall properly. But I hope you got the point that resid(fit,'gof') as with most goodness of fit tests assumes prespecified models. Yes. I understand. Late last night someone replied similarly off the list but I didn't realize they hadn't posted to the list until this morning, otherwise I would have posted a note that my question had been resolved. Sorry for taking your time. T -- Trevor Wiens [EMAIL PROTECTED] The significant problems that we face cannot be solved at the same level of thinking we were at when we created them. (Albert Einstein) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Logistic regression goodness of fit tests
I was unsure of what suitable goodness-of-fit tests existed in R for logistic regression. After searching the R-help archive I found that using the Design models and resid, could be used to calculate this as follows: d - datadist(mydataframe) options(datadist = 'd') fit - lrm(response ~ predictor1 + predictor2..., data=mydataframe, x =T, y=T) resid(fit, 'gof'). I set up a script to first use glm to create models use stepAIC to determine the optimal model. I used this instead of fastbw because I found the AIC values to be completely different and the final models didn't always match. Then my script takes the reduced model formula and recreates it using lrm as above. Now the problem is that for some models I run into an error to which I can find no reference whatsoever on the mailing list or on the web. It is as follows: test.lrm - lrm(cclo ~ elev + aspect + cti_var + planar + feat_div + loamy + sands + sandy + wet + slr_mean, data=datamatrix, x = T, y = T) singular information matrix in lrm.fit (rank= 10 ). Offending variable(s): slr_mean Error in j:(j + params[i] - 1) : NA/NaN argument Now if I add the singularity criterion and make the value smaller than the default of 1E-7 to 1E-9 or 1E-12 which is the default in calibrate, it works. Why is that? Not being a statistician but a biogeographer using regression as a tool, I don't really understand what is happening here. Does changing the tol variable, change how I should interpret goodness-of-fit results or other evaluations of the models created? I've included a summary of the data below (in case it might be helpful) with all variables in the data frame as it was easier than selecting out the ones used in the model. Thanks in advance. T -- Trevor Wiens [EMAIL PROTECTED] The significant problems that we face cannot be solved at the same level of thinking we were at when we created them. (Albert Einstein) summary(datamatrix) siteid block recordyearcclo 564-125: 5 Min. :1.000 Min. :2000 Min. :0. 564-130: 5 1st Qu.:2.000 1st Qu.:2001 1st Qu.:1. 564-135: 5 Median :3.000 Median :2002 Median :1. 564-140: 5 Mean :3.042 Mean :2002 Mean :0.7509 564-145: 5 3rd Qu.:4.000 3rd Qu.:2003 3rd Qu.:1. 564-150: 5 Max. :5.000 Max. :2004 Max. :1. (Other):1098 elevslopeaspect slr_mean Min. :0. Min. :0.1499 Min. :0. Min. :7681 1st Qu.:0. 1st Qu.:0.5876 1st Qu.:0. 1st Qu.:7852 Median :1. Median :0.9195 Median :0. Median :7877 Mean :0.6259 Mean :1.2523 Mean :0.2482 Mean :7871 3rd Qu.:1. 3rd Qu.:1.6694 3rd Qu.:0. 3rd Qu.:7892 Max. :1. Max. :5.3366 Max. :1. Max. :7981 cti cti_var planar feat_div Min. :7.157 Min. :0.4497 Min. :0. Min. :1.000 1st Qu.:7.651 1st Qu.:0.6187 1st Qu.:1. 1st Qu.:2.000 Median :7.720 Median :0.8495 Median :1. Median :3.000 Mean :7.763 Mean :0.9542 Mean :0.8254 Mean :3.379 3rd Qu.:7.822 3rd Qu.:1.1918 3rd Qu.:1. 3rd Qu.:4.000 Max. :8.769 Max. :2.5615 Max. :1. Max. :6.000 chop_san loamysandssandy Min. :0.0 Min. :0. Min. :0. Min. :0. 1st Qu.:0.0 1st Qu.:0. 1st Qu.:0. 1st Qu.:0. Median :0.0 Median :0. Median :0. Median :0. Mean :0.05762 Mean :0.3094 Mean :0.3236 Mean :0.1099 3rd Qu.:0.0 3rd Qu.:1. 3rd Qu.:1. 3rd Qu.:0. Max. :1.0 Max. :1. Max. :1. Max. :1. wet timesinceburn ndvi evi Min. :0.0 Min. : 1.00 Min. :0.1140 Min. :0.1041 1st Qu.:0.0 1st Qu.:100.00 1st Qu.:0.2973 1st Qu.:0.1667 Median :0.0 Median :100.00 Median :0.3342 Median :0.2027 Mean :0.01950 Mean : 87.84 Mean :0.3629 Mean :0.2184 3rd Qu.:0.0 3rd Qu.:100.00 3rd Qu.:0.4463 3rd Qu.:0.2711 Max. :1.0 Max. :100.00 Max. :0.5932 Max. :0.4788 msavi2 fc gddprecip Min. :0.09156 Min. :0.1552 Min. :380.6 Min. : 50.04 1st Qu.:0.14936 1st Qu.:0.3246 1st Qu.:492.8 1st Qu.: 76.17 Median :0.18257 Median :0.4082 Median :500.8 Median : 85.50 Mean :0.19653 Mean :0.4398 Mean :476.4 Mean : 94.35 3rd Qu.:0.24626 3rd Qu.:0.5630 3rd Qu.:501.6 3rd Qu.: 95.16 Max. :0.33258 Max. :0.6996 Max.
Re: [R] Logistic regression goodness of fit tests
On Thu, 10 Mar 2005, Trevor Wiens wrote: I was unsure of what suitable goodness-of-fit tests existed in R for logistic regression. After searching the R-help archive I found that using the Design models and resid, could be used to calculate this as follows: d - datadist(mydataframe) options(datadist = 'd') fit - lrm(response ~ predictor1 + predictor2..., data=mydataframe, x =T, y=T) resid(fit, 'gof'). I set up a script to first use glm to create models use stepAIC to determine the optimal model. I used this instead of fastbw because I found the AIC values to be completely different and the final models didn't always match. Then my script takes the reduced model formula and recreates it using lrm as above. Now the problem is that for some models I run into an error to which I can find no reference whatsoever on the mailing list or on the web. It is as follows: test.lrm - lrm(cclo ~ elev + aspect + cti_var + planar + feat_div + loamy + sands + sandy + wet + slr_mean, data=datamatrix, x = T, y = T) singular information matrix in lrm.fit (rank= 10 ). Offending variable(s): slr_mean Error in j:(j + params[i] - 1) : NA/NaN argument Now if I add the singularity criterion and make the value smaller than the default of 1E-7 to 1E-9 or 1E-12 which is the default in calibrate, it works. Why is that? Not being a statistician but a biogeographer using regression as a tool, I don't really understand what is happening here. From one geographer to another, and being prepared to bow to better-founded explanations, you seem to have included a variable - the offending variable slr_mean - that is very highly correlated with another. Making the tolerance tighter says that you are prepared to take the risk of confounding your results. You've already been fishing for right hand side variables anyway, so your results are somewhat prejudiced, aren't they? I think you may also like to review which of the right hand side variables should be treated as factors rather than numeric (looking at the summary suggests that many are factors), and perhaps the dependent variable too, although lrm() seems to take care of this if you haven't. Does changing the tol variable, change how I should interpret goodness-of-fit results or other evaluations of the models created? I've included a summary of the data below (in case it might be helpful) with all variables in the data frame as it was easier than selecting out the ones used in the model. Thanks in advance. T -- Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Breiviksveien 40, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93 e-mail: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
Trevor Wiens wrote: I was unsure of what suitable goodness-of-fit tests existed in R for logistic regression. After searching the R-help archive I found that using the Design models and resid, could be used to calculate this as follows: d - datadist(mydataframe) options(datadist = 'd') fit - lrm(response ~ predictor1 + predictor2..., data=mydataframe, x =T, y=T) resid(fit, 'gof'). I set up a script to first use glm to create models use stepAIC to determine the optimal model. I used this instead of fastbw because I found the AIC values to be completely different and the final models didn't always match. Then my script takes the reduced model formula and recreates it using lrm as above. Now the problem is that for some models I run into an error to which I can find no reference whatsoever on the mailing list or on the web. It is as follows: test.lrm - lrm(cclo ~ elev + aspect + cti_var + planar + feat_div + loamy + sands + sandy + wet + slr_mean, data=datamatrix, x = T, y = T) singular information matrix in lrm.fit (rank= 10 ). Offending variable(s): slr_mean Error in j:(j + params[i] - 1) : NA/NaN argument Now if I add the singularity criterion and make the value smaller than the default of 1E-7 to 1E-9 or 1E-12 which is the default in calibrate, it works. Why is that? Not being a statistician but a biogeographer using regression as a tool, I don't really understand what is happening here. Does changing the tol variable, change how I should interpret goodness-of-fit results or other evaluations of the models created? I've included a summary of the data below (in case it might be helpful) with all variables in the data frame as it was easier than selecting out the ones used in the model. Thanks in advance. T The goodness of fit test only works on prespecified models. It is not valid when stepwise variable selection is used (unless perhaps you use alpha=0.5). -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
On Thu, 10 Mar 2005 22:36:09 +0100 (CET) Roger Bivand [EMAIL PROTECTED] wrote: From one geographer to another, and being prepared to bow to better-founded explanations, you seem to have included a variable - the offending variable slr_mean - that is very highly correlated with another. Making the tolerance tighter says that you are prepared to take the risk of confounding your results. You've already been fishing for right hand side variables anyway, so your results are somewhat prejudiced, aren't they? I think you may also like to review which of the right hand side variables should be treated as factors rather than numeric (looking at the summary suggests that many are factors), and perhaps the dependent variable too, although lrm() seems to take care of this if you haven't. Thanks for your informative reply. The nature of the research is habitat selection for 15 species of grassland birds (my masters project). The response here is presence of Chestnut-collared Longsur (cclo). I very carefull reviewed the variables for collinearity and none of them showed any difficulty except in a few cases which I've used to break some cases into two models where one would have otherwise seemed reasonable. Just out of interest however I did run the global model, and this problem didn't occur, which seems to indicate to me, based on your comments, I'm seeing an interaction effect, not a result of two closely correlated variables. I don't think I've been fishing. I selected variables for inclusion in competing models based on ecologically reasonable criteria. I did examine relationships between species occurance and static variables such as dem derived variables, to see if the data supported including all variables that based on ecological criteria should explain the birds distribution. I have included some variables inspite of weak statistical relationships based on a paper by Anderson, Burnham and Thompson in Journal of Wildlife Management which talks about how factors that are ecologically significant, can have interaction effects in a model to provide explanation in your response variable, even though individually they are not statistically significant by themselves. So, I've tried to avoid fishing, but instead simply trying to select the most parsimonious models from the set of selected models for each species. Based on your advice once I've selected my top candidate models, I'll re-run at lower tolerances and only keep models that can pass at that level. Alternatively I could simply reject models that don't pass at lower tolerances. I do find it curious however that they run fine using glm (family = binomial) without complaint. Thanks again. T -- Trevor Wiens [EMAIL PROTECTED] The significant problems that we face cannot be solved at the same level of thinking we were at when we created them. (Albert Einstein) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
On Thu, 10 Mar 2005 16:19:41 -0600 Frank E Harrell Jr [EMAIL PROTECTED] wrote: The goodness of fit test only works on prespecified models. It is not valid when stepwise variable selection is used (unless perhaps you use alpha=0.5). Perhaps I'm blind, but I can't find any reference to an alpha parameter on the help page for stepAIC. It runs fine however when I set the parameter and produces as model with the same right hand variables as without. Can you tell me what it is ? T -- Trevor Wiens [EMAIL PROTECTED] The significant problems that we face cannot be solved at the same level of thinking we were at when we created them. (Albert Einstein) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] logistic regression and 3PL model
Dear Mike, You appear to have misinterpreted Brian Ripley's advice. The logitreg() function that you've included in your message hasn't been modified to specify a lower bound (higher than 0) to the probability of success. The lower argument, which gets passed through to nlminb(), specifies the lower bounds for the parameters, not for the fitted probabilities. You also apparently have sent the S-PLUS version of the function. To get this to run in R, you should replace the call to nlminb() with one to optim(), and you'll have to modify the local function gmin() slightly. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Y. Lau Sent: Saturday, February 19, 2005 9:46 PM To: r-help@stat.math.ethz.ch Subject: [R] logistic regression and 3PL model Hello colleagues, This is a follow up to a question I posed in November regarding an analysis I was working on. Thank you to Dr. Brian Ripley and Dr. John Fox for helping me out during that time. I am conducting logistic regression on data set on psi (ESP) ganzfeld trials. The response variable is binary (correct/incorrect), with a 25% guessing base rate. Dr. Ripley suggested that I modify a maximum likelihood fitting of a binomial logistic regression presented in MASS$ (p. 445): logitreg - function(x, y, wt=rep(1, length(y)), intercept = T, start=rep(0,p), ...) { fmin - function (beta, X, y, w){ p - plogis(X %*% beta) -sum(2 * w * ifelse(y, log(p), log(1-p))) } gmin - function (beta, X, y, w) { eta - X %*% beta; p - plogis(eta) -2 * (w *dlogis(eta) * ifelse(y, 1/p, -1/(1-p))) %*% X } if(is.null(dim(x))) dim(x) - c(length(x), 1) dn - dimnames(x)[[2]] if(!length(dn)) dn - paste(Var, 1:ncol(x), sep=) p - ncol(x) + intercept if(intercept) {x - cbind(1,x); dn - c((Intercept), dn)} if(is.factor(y)) y - (unclass(y) !=1) fit - nlminb(start, fmin, gmin, X=x, y=y, w=wt, method = BFGS, ...) names(fit$par) - dn cat(\nCoefficients:\n); print(fit$par) # R: use fit$value and fit$convergence cat(\nResidual Deviance:, format(fit$objective), \n) cat(\nConvergence message:, fit$message, \n) invisible(fit) } I have been using lower=.25 in the call for logitreg: options(contrasts = c(contr.treatment, contr.poly)) X - model.matrix(longhit ~ age*gender, data=data)[,-1] logitreg(X, data$longhit, lower =.25) However, this is giving me .25 as the value for intercepts and all the regression weights. Am I correcting for the guessing base rate correctly? Any suggestions or pointers would be greatly appreciated. Thank you in advance. Mike Lau __ Michael Y. Lau, M.A. 118 Haggar Hall Department of Psychology University of Notre Dame Notre Dame, IN 46556 (574) 631-1904 [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression and 3PL model
Hello colleagues, This is a follow up to a question I posed in November regarding an analysis I was working on. Thank you to Dr. Brian Ripley and Dr. John Fox for helping me out during that time. I am conducting logistic regression on data set on psi (ESP) ganzfeld trials. The response variable is binary (correct/incorrect), with a 25% guessing base rate. Dr. Ripley suggested that I modify a maximum likelihood fitting of a binomial logistic regression presented in MASS$ (p. 445): logitreg - function(x, y, wt=rep(1, length(y)), intercept = T, start=rep(0,p), ...) { fmin - function (beta, X, y, w){ p - plogis(X %*% beta) -sum(2 * w * ifelse(y, log(p), log(1-p))) } gmin - function (beta, X, y, w) { eta - X %*% beta; p - plogis(eta) -2 * (w *dlogis(eta) * ifelse(y, 1/p, -1/(1-p))) %*% X } if(is.null(dim(x))) dim(x) - c(length(x), 1) dn - dimnames(x)[[2]] if(!length(dn)) dn - paste(Var, 1:ncol(x), sep=) p - ncol(x) + intercept if(intercept) {x - cbind(1,x); dn - c((Intercept), dn)} if(is.factor(y)) y - (unclass(y) !=1) fit - nlminb(start, fmin, gmin, X=x, y=y, w=wt, method = BFGS, ...) names(fit$par) - dn cat(\nCoefficients:\n); print(fit$par) # R: use fit$value and fit$convergence cat(\nResidual Deviance:, format(fit$objective), \n) cat(\nConvergence message:, fit$message, \n) invisible(fit) } I have been using lower=.25 in the call for logitreg: options(contrasts = c(contr.treatment, contr.poly)) X - model.matrix(longhit ~ age*gender, data=data)[,-1] logitreg(X, data$longhit, lower =.25) However, this is giving me .25 as the value for intercepts and all the regression weights. Am I correcting for the guessing base rate correctly? Any suggestions or pointers would be greatly appreciated. Thank you in advance. Mike Lau __ Michael Y. Lau, M.A. 118 Haggar Hall Department of Psychology University of Notre Dame Notre Dame, IN 46556 (574) 631-1904 [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression
Hi, I'm using glm function to do logistic regression and now I want to know if it exists a kind of R-squared with this function in order to check the model. Thank you. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression
Terry Elrod posted functions for (pseudo-)R^2 on S-news back in 1998: http://www.biostat.wustl.edu/archives/html/s-news/1998-02/msg00267.html I am not sure that it performs well as a check of the model, so you might want to investigate that further. [EMAIL PROTECTED] wrote: I'm using glm function to do logistic regression and now I want to know if it exists a kind of R-squared with this function in order to check the model. -- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 452-1424 (M, W, F) fax: (917) 438-0894 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] logistic regression
Helene, you should read up about AIC, deviance, and deviance residuals, then look at the summary() for your model. Dan Dr. Daniel P. Bebber Department of Plant Sciences University of Oxford OX1 3RB UK Message: 78 Date: Tue, 8 Feb 2005 11:15:35 +0100 From: [EMAIL PROTECTED] Subject: [R] logistic regression To: r-help@stat.math.ethz.ch Cc: [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=ISO-8859-1 Hi, I'm using glm function to do logistic regression and now I want to know if it exists a kind of R-squared with this function in order to check the model. Thank you. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html