[R] summary vs anova

2011-12-19 Thread Brent Pedersen
Hi, I'm sure this is simple, but I haven't been able to find this in TFM,
say I have some data in R like this (pasted here:
http://pastebin.com/raw.php?i=sjS9Zkup):

   head(df)
gender age smokes diseaseY
  1 female  65   ever control 0.18
  2 female  77  never control 0.12
  3   male  40 state1 0.11
  4 female  67   ever control 0.20
  5   male  63   ever  state1 0.16
  6 female  26  never  state1 0.13

where unique(disease) == c(control, state1, state2)
and unique(smokes) == c(ever, never, , current)

I then fit a linear model like:

 model = lm(Y ~ smokes + disease + age + gender, data=df)

And I want to understand the difference between:

 print(summary(model))
Call:
lm(formula = Y ~ smokes + disease + age + gender, data = df)

Residuals:
 Min   1Q   Median   3Q  Max
-0.22311 -0.08108 -0.03483  0.05604  0.46507

Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept)0.1206825  0.0521368   2.315   0.0211 *
smokescurrent  0.0150641  0.066   0.339   0.7348
smokesever 0.0498764  0.0326254   1.529   0.1271
smokesnever0.0394109  0.0349142   1.129   0.2597
diseasestate1  0.0018739  0.0176817   0.106   0.9157
diseasestate2 -0.0009858  0.0178651  -0.055   0.9560
age0.0002841  0.0006290   0.452   0.6518
gendermale 0.1164889  0.0128748   9.048   2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1257 on 397 degrees of freedom
Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791
F-statistic: 13.59 on 7 and 397 DF,  p-value: 8.975e-16


and:

   anova(model)
  Analysis of Variance Table

  Response: Y
 Df Sum Sq Mean Sq F value  Pr(F)
  smokes  3 0.1536 0.05120  3.2397 0.02215 *
  disease 2 0.0129 0.00647  0.4096 0.66420
  age 1 0.0431 0.04310  2.7270 0.09946 .
  gender  1 1.2937 1.29373 81.8634  2e-16 ***
  Residuals 397 6.2740 0.01580
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I understand (hopefully correctly) that anova() tests by adding each covariate
to the model in order it is specified in the formula.

More specific questions are:

1) How do the p-values for smokes* in summary(model) relate to the
   Pr(F) for smokes in anova
2) what do the p-values for each of those smokes* mean exactly?
3) the summary above shows the values for diseasestate1 and diseasestate2
   how can I get the p-value for diseasecontrol? (or, e.g. genderfemale)

thanks.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] repeated measures setup

2011-03-28 Thread Brent Pedersen
hi, i have some data, a subset of which is pasted at the end of this message.
i am trying to understand how to do repeated measures as our study
design consists
of a subject and up to 2 siblings.

thus far, my model looks like this--with family_id indicating a
sibling relationship:

 formula = y ~ concordant + age.proband + age.other + sex.proband + sex.other 
 + Error(family_id)

i have seen a lot of resources where this is specified as
Error(family_id/(all_other_variables)),
should that be the case here? or is the above formulation sufficient
to capture the repeated
measures by famly?

also, there seem to be a number of resources for doing this type of
analysis, is a particular package that has more traction that i should
look into?

thanks,
-brent



concordant  family_id   external_ref.probandexternal_ref.other
sex.proband sex.other   age.proband age.other   y(fake)
T   58  8001555080015543M   F   15  19  1
F   58  8001555080016946M   F   15  8   2
T   54  8001549980015338F   F   5   7   3
F   54  8001549980013112F   M   5   13  4
F   22  8001226980012252F   F   12  10  5
F   22  8001226980018691F   M   12  8   5

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.