Re: [R] drop1() seems to give unexpected results compare to anova()

2008-08-03 Thread Thomas P C Chu
Thanks to Mr Dalgaard for his advice and everyone else who has 
contributed. Inclusion of an error term at the end of sim.set$y = ... 
line did cure my problems with drop1() and step().


I suppose it is my own inexperience in carrying out simulations caused 
such gaffe.


Thomas

AOL Email goes Mobile! You can now read your AOL Emails whilst on the 
move. Sign up for a free AOL Email account with unlimited storage today.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] drop1() seems to give unexpected results compare to anova()

2008-08-02 Thread Peter Dalgaard

Thomas P C Chu wrote:

Dear all,

I have been trying to investigate the behaviour of different weights 
in weighted regression for a dataset with lots of missing data. As a 
start I simulated some data using the following:


library(MASS)
N - 200
sigma - matrix(c(1, .5, .5, 1), nrow = 2)
sim.set - as.data.frame(mvrnorm(N, c(0, 0), sigma))
colnames(sim.set) - c('x1', 'x2') # x1  x2 are correlated
sim.set$x3 - rnorm(N, 0, 1) # x3 is independent
sim.set$x4 - rnorm(N, 0, 1) # x4 is red herring
sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is 
outcome
This is your problem: There is no noise term in there. As a result, F 
and t tests are basically 0/0 except for errors introduced by roundoff. 
These errors can affect numerators and denominators differently when 
different numerical methods are used, and it is really a toss-up whether 
this results in a significant test or not.




I then checked the correlation of my simulated data and fitted a 
linear regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed.


round(cor(sim.set), 2)
summary(model - lm(y ~ ., data = sim.set))

Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666
x1 4.000e+00 1.388e-16 2.881e+16 2e-16 ***
x2 5.000e+00 1.441e-16 3.470e+16 2e-16 ***
x3 6.000e+00 1.188e-16 5.051e+16 2e-16 ***
x4 -8.150e-18 1.165e-16 -7.000e-02 0.944

anova(model)

Df Sum Sq Mean Sq F value Pr(F)
x1 1 8686.1 8686.1 2.8218e+33 2e-16 ***
x2 1 5568.7 5568.7 1.8091e+33 2e-16 ***
x3 1 7852.1 7852.1 2.5509e+33 2e-16 ***
x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443
Residuals 195 6.002e-28 3.078e-30

All was well so far, as x4 was identified as not significant and its 
coeff was almost 0 (because I made it so in the first place). Now I 
expected it to be dropped in stepwise:


step(model, direction = 'both', test = 'F')
drop1(model, test = 'F')
dropterm(model, test = 'F')

Df Sum of Sq RSS AIC F value Pr(F)
none 6.002e-28 -13585.7
x1 1 2555.1 2555.1 517.5 8.3006e+32  2.2e-16 ***
x2 1 3707.0 3707.0 591.9 1.2043e+33  2.2e-16 ***
x3 1 7851.9 7851.9 742.0 2.5508e+33  2.2e-16 ***
x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02  2.2e-16 ***

Neither of those 3 lines of commands managed to drop x4 and its P 
value magically decreased from 0.94 to almost 0! I am also baffled by 
how R calculated those RSS. However, if I fitted the smaller model and 
compared it with the original one by hand, I got the expected answer:


summary(model2 - lm(y ~ x1 + x2 + x3, data = sim.set))
anova(model, model2, test = 'F')

Model 1: y ~ x1 + x2 + x3 + x4
Model 2: y ~ x1 + x2 + x3
Res.Df RSS Df Sum of Sq F Pr(F)
1 195 6.0025e-28
2 196 6.0026e-28 -1 -2.e-32 0.0049 0.9443

Interestingly, if I had started from a null model I ended up with y = 
4 * x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected.


summary(model1 - lm(y ~ 1, data = sim.set))
step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test 
= 'F')

add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')
addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')

Df Sum of Sq RSS AIC F value Pr(F)
none 22107.0 943.1
x1 1 8686.1 13420.8 845.2 128.1478 2e-16 ***
x2 1 11658.7 10448.3 795.2 220.9377 2e-16 ***
x3 1 11045.4 11061.6 806.6 197.7096 2e-16 ***
x4 1 13.4 22093.6 944.9 0.1199 0.7295

I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux, 
with all components up to date. Thank you in advance for all your 
thoughts and replies.


Yours sincerely,
Thomas P C Chu

University of Leicester
LE1 7RH
UK


AOL Email goes Mobile! You can now read your AOL Emails whilst on the 
move. Sign up for a free AOL Email account with unlimited storage today.




**
See what's new at http://www.aol.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



--
  O__   Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
 c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] drop1() seems to give unexpected results compare to anova()

2008-08-02 Thread Thomas P C Chu
I am not sure why my messages are not threaded together. Thank you to 
the author of this post:

https://stat.ethz.ch/pipermail/r-help/2008-August/169691.html
I have tried the suggestions, but I got the same results as in my 
original query:

https://stat.ethz.ch/pipermail/r-help/2008-August/169647.html

I have considered the issue of partial and sequential sum of squares. 
Given that the variable x4 (the red herring) entered the model last in 
the sequence, I thought partial and sequential SS ought to be 
numerically the same.


However, I later found out that using glm() instead of lm() gave the 
expected results. See:

https://stat.ethz.ch/pipermail/r-help/2008-August/169673.html
I have read that glm() uses iterative re-weighted least square (which I 
think is related to maximum likelihood) for fitting whereas lm() uses 
matrix decomposition. I do expect slightly different answers from these 
two functions, but not ones that were so far apart!


Now I can use glm() as a workaround, but I just want to make sure there 
are no bugs in drop1(). Hopefully more people can give their opinions 
whether there is a bug.


Thomas P C Chu

AOL Email goes Mobile! You can now read your AOL Emails whilst on the 
move. Sign up for a free AOL Email account with unlimited storage today.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] drop1() seems to give unexpected results compare to anova()

2008-08-01 Thread Thomas P C Chu

Dear all,

I have been trying to investigate the behaviour of different weights in 
weighted regression for a dataset with lots of missing data. As a start 
I simulated some data using the following:


library(MASS)
N - 200
sigma - matrix(c(1, .5, .5, 1), nrow = 2)
sim.set - as.data.frame(mvrnorm(N, c(0, 0), sigma))
colnames(sim.set) - c('x1', 'x2') # x1  x2 are correlated
sim.set$x3 - rnorm(N, 0, 1) # x3 is independent
sim.set$x4 - rnorm(N, 0, 1) # x4 is red herring
sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is 
outcome


I then checked the correlation of my simulated data and fitted a linear 
regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed.


round(cor(sim.set), 2)
summary(model - lm(y ~ ., data = sim.set))

Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666
x1 4.000e+00 1.388e-16 2.881e+16 2e-16 ***
x2 5.000e+00 1.441e-16 3.470e+16 2e-16 ***
x3 6.000e+00 1.188e-16 5.051e+16 2e-16 ***
x4 -8.150e-18 1.165e-16 -7.000e-02 0.944

anova(model)

Df Sum Sq Mean Sq F value Pr(F)
x1 1 8686.1 8686.1 2.8218e+33 2e-16 ***
x2 1 5568.7 5568.7 1.8091e+33 2e-16 ***
x3 1 7852.1 7852.1 2.5509e+33 2e-16 ***
x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443
Residuals 195 6.002e-28 3.078e-30

All was well so far, as x4 was identified as not significant and its 
coeff was almost 0 (because I made it so in the first place). Now I 
expected it to be dropped in stepwise:


step(model, direction = 'both', test = 'F')
drop1(model, test = 'F')
dropterm(model, test = 'F')

Df Sum of Sq RSS AIC F value Pr(F)
none 6.002e-28 -13585.7
x1 1 2555.1 2555.1 517.5 8.3006e+32  2.2e-16 ***
x2 1 3707.0 3707.0 591.9 1.2043e+33  2.2e-16 ***
x3 1 7851.9 7851.9 742.0 2.5508e+33  2.2e-16 ***
x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02  2.2e-16 ***

Neither of those 3 lines of commands managed to drop x4 and its P value 
magically decreased from 0.94 to almost 0! I am also baffled by how R 
calculated those RSS. However, if I fitted the smaller model and 
compared it with the original one by hand, I got the expected answer:


summary(model2 - lm(y ~ x1 + x2 + x3, data = sim.set))
anova(model, model2, test = 'F')

Model 1: y ~ x1 + x2 + x3 + x4
Model 2: y ~ x1 + x2 + x3
Res.Df RSS Df Sum of Sq F Pr(F)
1 195 6.0025e-28
2 196 6.0026e-28 -1 -2.e-32 0.0049 0.9443

Interestingly, if I had started from a null model I ended up with y = 4 
* x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected.


summary(model1 - lm(y ~ 1, data = sim.set))
step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test 
= 'F')

add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')
addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')

Df Sum of Sq RSS AIC F value Pr(F)
none 22107.0 943.1
x1 1 8686.1 13420.8 845.2 128.1478 2e-16 ***
x2 1 11658.7 10448.3 795.2 220.9377 2e-16 ***
x3 1 11045.4 11061.6 806.6 197.7096 2e-16 ***
x4 1 13.4 22093.6 944.9 0.1199 0.7295

I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux, 
with all components up to date. Thank you in advance for all your 
thoughts and replies.


Yours sincerely,
Thomas P C Chu

University of Leicester
LE1 7RH
UK


AOL Email goes Mobile! You can now read your AOL Emails whilst on the 
move. Sign up for a free AOL Email account with unlimited storage today.




**
See what's new at http://www.aol.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] drop1() seems to give unexpected results compare to anova()

2008-08-01 Thread Thomas P C Chu
Interestingly, if I fitted the model using glm() rather than lm(), 
drop1() would behave as expected:


summary(model.glm - glm(y ~ ., data = sim.set, family = 'gaussian'))
summary(model.lm - lm(y ~ ., data = sim.set))
drop1(model.glm, test = 'F')
drop1(model.lm, test = 'F')
model.glm - step(model.glm, direction = 'both', test = 'F')
model.lm - step(model.lm, direction = 'both', test = 'F')
summary(model.glm)
summary(model.lm)

Although it is debatable whether one should use glm(family = 
'gaussian') rather than lm() for fitting models with normal 
distribution residuals. This raises the suspicion that there could be a 
bug in drop1() and step(), which I think uses add1() and drop1() 
repeatedly.



Thomas P C Chu

AOL Email goes Mobile! You can now read your AOL Emails whilst on the 
move. Sign up for a free AOL Email account with unlimited storage today.




**
See what's new at http://www.aol.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] drop1() seems to give unexpected results compare to anova()

2008-08-01 Thread Jeroen Ooms


Thomas Chu wrote:
 
 Neither of those 3 lines of commands managed to drop x4 and its P value 
 magically decreased from 0.94 to almost 0! I am also baffled by how R 
 calculated those RSS. 
 
Maybe it is using a different type of SS. If i have a lm() model, and i do:
options(contrasts=c(contr.sum, contr.poly));
anovafit - drop1(model,attributes(model$terms)$term.labels,test=F);
then i get identical SS, F and p values as in SPSS. Maybe 
http://www.nabble.com/set-type-of-SS-in-anova()-to18287076.html#a18287076
this post  is helpfull. Also check out the post on 
http://myowelt.blogspot.com/ this blog  from 2008-05-24: Obtaining the same
ANOVA results in R as in SPSS - the difficulties with Type II and Type III
sums of squares .

-- 
View this message in context: 
http://www.nabble.com/drop1%28%29-seems-to-give-unexpected-results-compare-to-anova%28%29-tp18770635p1813.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.