Re: [R] testing non-linear component in mgcv:gam
In this case the models being compared are really identical, and the P-value is meaningless numerical noise. If your main focus is hypothesis testing and you really need near exact p-values, then do this sort of testing using unpenalized models. i.e. don't have mgcv::gam estimate the EDF of the smooth, just use `s(...,fx=TRUE)' to estimate it unpenalized. This gives horrible point estimates and excellent p-values. best, Simon Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] testing non-linear component in mgcv:gam
Looks like a bug in mgcv::summary.gam when the model is strictly parametric... I'll take a look and fix it. thanks, Simon In my original message I mentioned a gam fit that turns out to be a linear fit. By curiosity I analysed it with a linear predictor only with mgcv package, and then as a linear model. The output was identical in both, but the r-sq (adj) was 0.55 in mgcv and 0.26 in lm. In lm I hope that my interpretation that 26% of the variance in y is explained by the linear relationship with x is valid. Then what does r2 mean in mgcv? Denis summary.gam(lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 summary(sed.true.lin) Call: lm(formula = wm.sed ~ Temp, weights = N.sets) Residuals: Min 1Q Median 3Q Max -0.6138 -0.1312 -0.0325 0.1089 1.1449 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.3061 on 35 degrees of freedom Multiple R-Squared: 0.285,Adjusted R-squared: 0.2646 F-statistic: 13.95 on 1 and 35 DF, p-value: 0.000666 Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms:
Re: [R] testing non-linear component in mgcv:gam
I think you probably should state more clearly the goal of your analysis. In such situation, estimation and hypothesis testing are quite different. The procedure that gives you the `best' estimate is not necessarily the `best' for testing linearity of components. If your goal is estimation/prediction, why test linearity of components? If the goal is testing linearity, then you probably need to find the procedure that gives you a good test, rather than relying on what gam() gives you. Just my $0.02... Andy From: Denis Chabot Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] testing non-linear component in mgcv:gam
Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] testing non-linear component in mgcv:gam
Fair enough, Andy. I thought I was getting both predictive ability and confirmation that the phenomenon I was studying was not linear. I have two projects, in one prediction is the goal and I don't really need to test linearity. In the second I needed to confirm a cycle was taking place and I thought my procedure did this. There was no theoretical reason to anticipate the shape of the cycle, so GAM was an appealing methodology. Denis Le 05-10-05 à 09:38, Liaw, Andy a écrit : I think you probably should state more clearly the goal of your analysis. In such situation, estimation and hypothesis testing are quite different. The procedure that gives you the `best' estimate is not necessarily the `best' for testing linearity of components. If your goal is estimation/prediction, why test linearity of components? If the goal is testing linearity, then you probably need to find the procedure that gives you a good test, rather than relying on what gam() gives you. Just my $0.02... Andy From: Denis Chabot Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279 5.5554e-10 2.353e-11 0.4521 6.18e-09 *** Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice: This
Re: [R] testing non-linear component in mgcv:gam
Hi John, Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John This is one of my difficulties. In some examples I found on the web, the difference in deviance is compared directly against the chi- squared distribution. But my y variable has a very small range (between 0 and 0.5, most of the time) so the difference in deviance is always very small and if I compared it against the chi-squared distribution as I have seen done in examples, the non-linear component would always be not significant. Yet it is (with one exception), tested with both mgcv:gam and gam:gam. I think the examples I have read were wrong in this regard, the scale factor seen in mgcv output seems to intervene. But exactly how is still mysterious to me and I hesitate to judge the size of the deviance difference myself. I agree it is near zero in my example. I guess I need to have more experience with these models to better interpret the output... Denis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 # testing non-linear contribution sed.lin - gam(wm.sed~Temp,weight=N.sets) summary.gam(sed.lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 anova.gam(sed.lin, sed.gam, test=F) Analysis of Deviance Table Model 1: wm.sed ~ Temp Model 2: wm.sed ~ s(Temp) Resid. Df Resid. Dev Df Deviance F Pr(F) 1 3.5000e+01 3.279 2 3.5000e+01 3.279
Re: [R] testing non-linear component in mgcv:gam
Dear Denis, The chi-square test is formed in analogy to what's done for a GLM: The difference in residual deviance for the nested models is divided by the estimated scale parameter -- i.e., the estimated error variance for a model with normal errors. Otherwise, as you point out, the test would be dependent upon the scale of the response. John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 9:04 AM To: John Fox Cc: R list Subject: Re: [R] testing non-linear component in mgcv:gam Hi John, Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John This is one of my difficulties. In some examples I found on the web, the difference in deviance is compared directly against the chi- squared distribution. But my y variable has a very small range (between 0 and 0.5, most of the time) so the difference in deviance is always very small and if I compared it against the chi-squared distribution as I have seen done in examples, the non-linear component would always be not significant. Yet it is (with one exception), tested with both mgcv:gam and gam:gam. I think the examples I have read were wrong in this regard, the scale factor seen in mgcv output seems to intervene. But exactly how is still mysterious to me and I hesitate to judge the size of the deviance difference myself. I agree it is near zero in my example. I guess I need to have more experience with these models to better interpret the output... Denis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value s(Temp) 11 13.95 0.000666
Re: [R] testing non-linear component in mgcv:gam
Thank you everyone for your help, but my introduction to GAM is turning my brain to mush. I thought the one part of the output I understood the best was r-sq (adj), but now even this is becoming foggy. In my original message I mentioned a gam fit that turns out to be a linear fit. By curiosity I analysed it with a linear predictor only with mgcv package, and then as a linear model. The output was identical in both, but the r-sq (adj) was 0.55 in mgcv and 0.26 in lm. In lm I hope that my interpretation that 26% of the variance in y is explained by the linear relationship with x is valid. Then what does r2 mean in mgcv? Denis summary.gam(lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 summary(sed.true.lin) Call: lm(formula = wm.sed ~ Temp, weights = N.sets) Residuals: Min 1Q Median 3Q Max -0.6138 -0.1312 -0.0325 0.1089 1.1449 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3061 on 35 degrees of freedom Multiple R-Squared: 0.285,Adjusted R-squared: 0.2646 F-statistic: 13.95 on 1 and 35 DF, p-value: 0.000666 Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949, 0.054757925, 0.142810440, 0.168005168, 0.180804428, 0.111439628, 0.128799505, 0.193707937, 0.105921610, 0.103497845, 0.028591837, 0.217894389, 0.020535469, 0.080389068, 0.105234450, 0.070213450, 0.050771363, 0.042074434, 0.102348837, 0.049748344, 0.019100478, 0.005203125, 0.101711864, 0.0, 0.0, 0.014808824, 0.0, 0.22200, 0.16700, 0.0, 0.0, 0.0) sed.gam - gam(wm.sed~s(Temp),weight=N.sets) summary.gam(sed.gam) Family: gaussian Link function: identity Formula: wm.sed ~ s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.084030.01347 6.241 3.73e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Est.rank F p-value
Re: [R] testing non-linear component in mgcv:gam
Dear Denis, You got me: I would have thought from ?summary.gam that this would be the same as the adjusted R^2 for a linear model. Note, however, that the percentage of deviance explained checks with the R^2 from the linear model, as expected. Maybe you should address this question to the package author. Regards, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: Denis Chabot [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 05, 2005 3:33 PM To: John Fox Cc: R list Subject: Re: [R] testing non-linear component in mgcv:gam Thank you everyone for your help, but my introduction to GAM is turning my brain to mush. I thought the one part of the output I understood the best was r-sq (adj), but now even this is becoming foggy. In my original message I mentioned a gam fit that turns out to be a linear fit. By curiosity I analysed it with a linear predictor only with mgcv package, and then as a linear model. The output was identical in both, but the r-sq (adj) was 0.55 in mgcv and 0.26 in lm. In lm I hope that my interpretation that 26% of the variance in y is explained by the linear relationship with x is valid. Then what does r2 mean in mgcv? Denis summary.gam(lin) Family: gaussian Link function: identity Formula: wm.sed ~ Temp Parametric coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 R-sq.(adj) = 0.554 Deviance explained = 28.5% GCV score = 0.09904 Scale est. = 0.093686 n = 37 summary(sed.true.lin) Call: lm(formula = wm.sed ~ Temp, weights = N.sets) Residuals: Min 1Q Median 3Q Max -0.6138 -0.1312 -0.0325 0.1089 1.1449 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.162879 0.019847 8.207 1.14e-09 *** Temp-0.023792 0.006369 -3.736 0.000666 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.3061 on 35 degrees of freedom Multiple R-Squared: 0.285,Adjusted R-squared: 0.2646 F-statistic: 13.95 on 1 and 35 DF, p-value: 0.000666 Le 05-10-05 à 09:45, John Fox a écrit : Dear Denis, Take a closer look at the anova table: The models provide identical fits to the data. The differences in degrees of freedom and deviance between the two models are essentially zero, 5.5554e-10 and 2.353e-11 respectively. I hope this helps, John John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Denis Chabot Sent: Wednesday, October 05, 2005 8:22 AM To: r-help@stat.math.ethz.ch Subject: [R] testing non-linear component in mgcv:gam Hi, I need further help with my GAMs. Most models I test are very obviously non-linear. Yet, to be on the safe side, I report the significance of the smooth (default output of mgcv's summary.gam) and confirm it deviates significantly from linearity. I do the latter by fitting a second model where the same predictor is entered without the s(), and then use anova.gam to compare the two. I thought this was the equivalent of the default output of anova.gam using package gam instead of mgcv. I wonder if this procedure is correct because one of my models appears to be linear. In fact mgcv estimates df to be exactly 1.0 so I could have stopped there. However I inadvertently repeated the procedure outlined above. I would have thought in this case the anova.gam comparing the smooth and the linear fit would for sure have been not significant. To my surprise, P was 6.18e-09! Am I doing something wrong when I attempt to confirm the non- parametric part a smoother is significant? Here is my example case where the relationship does appear to be linear: library(mgcv) This is mgcv 1.3-7 Temp - c(-1.38, -1.12, -0.88, -0.62, -0.38, -0.12, 0.12, 0.38, 0.62, 0.88, 1.12, 1.38, 1.62, 1.88, 2.12, 2.38, 2.62, 2.88, 3.12, 3.38, 3.62, 3.88, 4.12, 4.38, 4.62, 4.88, 5.12, 5.38, 5.62, 5.88, 6.12, 6.38, 6.62, 6.88, 7.12, 8.38, 13.62) N.sets - c(2, 6, 3, 9, 26, 15, 34, 21, 30, 18, 28, 27, 27, 29, 31, 22, 26, 24, 23, 15, 25, 24, 27, 19, 26, 24, 22, 13, 10, 2, 5, 3, 1, 1, 1, 1, 1) wm.sed - c(0.0, 0.016129032, 0.0, 0.062046512, 0.396459596, 0.189082949