RE: [R] What is the most useful way to detect nonlinearity in lo
On 05-Dec-04 Patrick Foley wrote: It is easy to spot response nonlinearity in normal linear models using plot(something.lm). However plot(something.glm) produces artifactual peculiarities since the diagnostic residuals are constrained by the fact that y can only take values 0 or 1. What do R users find most useful in checking the linearity assumption of logistic regression (i.e. log-odds =a+bx)? Patrick Foley [EMAIL PROTECTED] The most useful way to detect nonlinearity in logistic regression is: a) have an awful lot of data b) have the x (covariate) values judiciously placed. Don't be optimistic about this prohlem. The amount of information, especially about non-linearity, in the binary responses is often a lot less than people intuitively expect. This is an area where R can be especially useful for self-education by exploring possibilities and simulation. For example, define the function (for quadratic nonlinearity): testlin2-function(a,b,N){ x-c(-1.0,-0.5,0.0,0.5,1.0) lp-a*x+b*x^2; p-exp(lp)/(1+exp(lp)) n-N*c(1,1,1,1,1) r-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]), rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]), rbinom(1,n[5],p[5]) ) resp-cbind(r,n-r) X-cbind(x,x^2);colnames(X)-c(x,x2) summary(glm(formula = resp ~ X - 1, family = binomial),correlation=TRUE) } This places N observations at each of (-1.0,0.5,0.0.5,1.0), generates the N binary responses with probability p(x) where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression forcing the intercept term to be 0 (so that you're not diluting the info by estimating a parameter you know to be 0), and returns the summary(glm(...)) from which the p-values can be extracted: The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]} You can run this function as a one-off for various values of a, b, N to get a feel for what happens. You can run a simulation on the lines of pvals-numeric(1000); for(i in (1:1000)){ pvals[i]-testlin2(1,0.1,500)$coefficients[2,4] } so that you can test how often you get a significant result. For example, adopting the ritual sigificant == P0.05, power = 80%, you can see a histogram of the p-values over the conventional significance breaks with hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) and you can see your probability of getting a significant result as e.g. sum(pvals 0.05)/1000 I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1 corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly interesting degree of nonlinearity), I had to go up to N=2000 before I was getting more than 80% of the p-values 0.05. That corresponds to 2000 observations at each value of x, or 10,000 observations in all. Compare this with a similar test for non-linearity with normally-distributed responses [exercise for the reader]. You can write functions similar to testlin2 for higher-order nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for a*x + b*x^2 + c*x^3, etc., (the modifications required are obvious) and see how you get on. As I say, don't be optimistic! In particular, run testlin3 a few times and see the sort of mess that can come out -- in particular gruesome correlations, which is why correlation=TRUE is set in the call to summary(glm(...),correlation=TRUE). Best wishes, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 05-Dec-04 Time: 12:59:08 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] What is the most useful way to detect nonlinearity in lo
(Ted Harding) wrote: On 05-Dec-04 Patrick Foley wrote: It is easy to spot response nonlinearity in normal linear models using plot(something.lm). However plot(something.glm) produces artifactual peculiarities since the diagnostic residuals are constrained by the fact that y can only take values 0 or 1. What do R users find most useful in checking the linearity assumption of logistic regression (i.e. log-odds =a+bx)? Patrick Foley [EMAIL PROTECTED] The most useful way to detect nonlinearity in logistic regression is: a) have an awful lot of data b) have the x (covariate) values judiciously placed. Don't be optimistic about this prohlem. The amount of information, especially about non-linearity, in the binary responses is often a lot less than people intuitively expect. This is an area where R can be especially useful for self-education by exploring possibilities and simulation. For example, define the function (for quadratic nonlinearity): testlin2-function(a,b,N){ x-c(-1.0,-0.5,0.0,0.5,1.0) lp-a*x+b*x^2; p-exp(lp)/(1+exp(lp)) n-N*c(1,1,1,1,1) r-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]), rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]), rbinom(1,n[5],p[5]) ) resp-cbind(r,n-r) X-cbind(x,x^2);colnames(X)-c(x,x2) summary(glm(formula = resp ~ X - 1, family = binomial),correlation=TRUE) } This places N observations at each of (-1.0,0.5,0.0.5,1.0), generates the N binary responses with probability p(x) where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression forcing the intercept term to be 0 (so that you're not diluting the info by estimating a parameter you know to be 0), and returns the summary(glm(...)) from which the p-values can be extracted: The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]} You can run this function as a one-off for various values of a, b, N to get a feel for what happens. You can run a simulation on the lines of pvals-numeric(1000); for(i in (1:1000)){ pvals[i]-testlin2(1,0.1,500)$coefficients[2,4] } so that you can test how often you get a significant result. For example, adopting the ritual sigificant == P0.05, power = 80%, you can see a histogram of the p-values over the conventional significance breaks with hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) and you can see your probability of getting a significant result as e.g. sum(pvals 0.05)/1000 I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1 corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly interesting degree of nonlinearity), I had to go up to N=2000 before I was getting more than 80% of the p-values 0.05. That corresponds to 2000 observations at each value of x, or 10,000 observations in all. Compare this with a similar test for non-linearity with normally-distributed responses [exercise for the reader]. You can write functions similar to testlin2 for higher-order nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for a*x + b*x^2 + c*x^3, etc., (the modifications required are obvious) and see how you get on. As I say, don't be optimistic! In particular, run testlin3 a few times and see the sort of mess that can come out -- in particular gruesome correlations, which is why correlation=TRUE is set in the call to summary(glm(...),correlation=TRUE). Best wishes, Ted. library(Design) # also requires Hmisc f - lrm(sick ~ sex + rcs(age,4) + rcs(blood.pressure,5)) # Restricted cubic spline in age with 4 knots, blood.pressure with 5 anova(f) # automatic tests of linearity of all predictors latex(f) # see algebraic form of fit summary(f)# get odds ratios for meaningful changes in predictors But beware of using the tests of linearity. If non-significant results cause you to reduce an effect to linear, confidence levels and type I errors are no longer preserved. I use tests of linearity mainly to demonstrate that effects are more often nonlinear than linear, given sufficient sample size. I.e., good analysts are needed. I usually leave non-significant nonlinearities in the model. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] What is the most useful way to detect nonlinearity in lo
On 05-Dec-04 Ted Harding wrote: [...] For example, adopting the ritual sigificant == P0.05, power = 80%, you can see a histogram of the p-values over the conventional significance breaks with hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) Sorry for the typo! That should of course (if you were looking closely) have been hist(pvals,breaks=c(0,0.01,0.05,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) Apologies to anyone who was misled by pasting in the bad version. and you can see your probability of getting a significant result as e.g. sum(pvals 0.05)/1000 [...] Best wishes to all, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 05-Dec-04 Time: 13:54:11 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] What is the most useful way to detect nonlinearity in lo
On 05-Dec-04 Peter Dalgaard wrote: Peter Dalgaard [EMAIL PROTECTED] writes: Re. the smoothed residuals, you do need to be careful about the smoother. Some of the robust ones will do precisely the wrong thing in this context: You really are interested in the mean, not some trimmed mean (which can easily amount to throwing away all your cases...). Here's an idea: x - runif(500) y - rbinom(500,size=1,p=plogis(x)) xx - predict(loess(resid(glm(y~x,binomial))~x),se=T) matplot(x,cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit),pch=20) Not sure my money isn't still on the splines, though. Doh. You might also want to make sure that the residuals are of a type that can be _expected_ to have mean zero. Apparently, the default deviance residuals do not have that property, whereas response residuals do. I did check that loess (as opposed to lowess!) does a plain least-squares based fitting by default, but I didn't think to check what kind of residuals I was looking at. Serves me right for posting way beyond my bedtime... Hi Peter, Yes, the above is certainly misleading (try it with 2000 instead of 500)! But what would you suggest instead? Thanks, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 05-Dec-04 Time: 19:38:00 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] What is the most useful way to detect nonlinearity in lo
(Ted Harding) [EMAIL PROTECTED] writes: x - runif(500) y - rbinom(500,size=1,p=plogis(x)) xx - predict(loess(resid(glm(y~x,binomial))~x),se=T) matplot(x,cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit),pch=20) Not sure my money isn't still on the splines, though. . Serves me right for posting way beyond my bedtime... Hi Peter, Yes, the above is certainly misleading (try it with 2000 instead of 500)! But what would you suggest instead? (I did and this little computer came tumbling down...). Basically, I'd reconsider the type= option to residual.glm. As I said, at least type=response should have the right mean. Ideally, you'd want to take advantage of the fact that the variance of the residuals is known too, rather than have the smoother estimate it. The more I think, the more I like the splines... -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] What is the most useful way to detect nonlinearity in lo
On 05-Dec-04 Peter Dalgaard wrote: (Ted Harding) [EMAIL PROTECTED] writes: x - runif(500) y - rbinom(500,size=1,p=plogis(x)) xx - predict(loess(resid(glm(y~x,binomial))~x),se=T) matplot(x,cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit),pch=20) Not sure my money isn't still on the splines, though. . Serves me right for posting way beyond my bedtime... Hi Peter, Yes, the above is certainly misleading (try it with 2000 instead of 500)! But what would you suggest instead? (I did and this little computer came tumbling down...). So did mine -- but at 5000 (which is the value I first tried): lots of disk grinding and then it went prprprprp and wrote words to the effect Calloc cannot allocate (18790050 times 4) i.e. it needed 72MB, which bankrupted my 192MB baby. 2000 was OK, however, but I had plenty of time for a meal etc. before it finished. Which brings up that predict(loess()) seems to be very memory-hungry. Basically, I'd reconsider the type= option to residual.glm. As I said, at least type=response should have the right mean. Ideally, you'd want to take advantage of the fact that the variance of the residuals is known too, rather than have the smoother estimate it. The more I think, the more I like the splines... I'll have a go at your suggestions (if I can get the syntax right ... ) Thanks, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 06-Dec-04 Time: 00:13:53 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] What is the most useful way to detect nonlinearity in lo
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Sunday, December 05, 2004 7:14 PM To: [EMAIL PROTECTED] Subject: Re: [R] What is the most useful way to detect nonlinearity in lo On 05-Dec-04 Peter Dalgaard wrote: (Ted Harding) [EMAIL PROTECTED] writes: x - runif(500) y - rbinom(500,size=1,p=plogis(x)) xx - predict(loess(resid(glm(y~x,binomial))~x),se=T) matplot(x,cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit),pch=20) Not sure my money isn't still on the splines, though. . Serves me right for posting way beyond my bedtime... Hi Peter, Yes, the above is certainly misleading (try it with 2000 instead of 500)! But what would you suggest instead? (I did and this little computer came tumbling down...). So did mine -- but at 5000 (which is the value I first tried): lots of disk grinding and then it went prprprprp and wrote words to the effect Calloc cannot allocate (18790050 times 4) i.e. it needed 72MB, which bankrupted my 192MB baby. 2000 was OK, however, but I had plenty of time for a meal etc. before it finished. Which brings up that predict(loess()) seems to be very memory-hungry. locfit to the rescue, perhaps? library(locfit) n - 5000 x - sort(runif(n)) y - rbinom(n, size=1, p=plogis(x)) system.time(xx - predict(locfit(resid(glm(y~x, binomial))~x), where=data, + se=TRUE), gcFirst=TRUE) [1] 0.79 0.00 0.84 NA NA matplot(x, cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit), pch=20) [The plot looks strange...] This is on my mobile Pentium 1.6GHz w/512MB laptop. Using loess it also ran out of memory. At n=2000, the loess route took just under 3 seconds. Cheers, Andy Basically, I'd reconsider the type= option to residual.glm. As I said, at least type=response should have the right mean. Ideally, you'd want to take advantage of the fact that the variance of the residuals is known too, rather than have the smoother estimate it. The more I think, the more I like the splines... I'll have a go at your suggestions (if I can get the syntax right ... ) Thanks, Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 06-Dec-04 Time: 00:13:53 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html