I don’t know if this question properly belongs on this list, but I’ll ask it
here because I’ve been using R to run linear regression models, and it is only
in using R (after switching from using SPSS) that I have discovered the process
of fitting a linear model. However, after reading Crowley (2002), Fox (2002),
Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I
cannot find an answer to my question.
I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one
response variable (G1L1WR). A simple main effects model finds that only PA is
statistically significant, and an anova comparison between a 5-variable main
effects model and a 1-variable main effects model finds no difference between
the models. So it is possible to simplify the model to just G1L1WR ~ PA. This
leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and
an adjusted R2 of 0.552.
I also decided, following Crawley’s (2002) advice, to create a maximal
model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a
stepAIC through the model revealed the model which had a maximal fit:
maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR +
NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA +
NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM +
NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data =
lafrance.NoNA)
All of the terms of this model have statistical t-tests, the residual standard
error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An
anova shows a clear difference between the simplified model and the maximal fit
model. My question is, should I really pick the maximal fit over the simple
model when it is really so much harder to understand? I guess there’s really no
easy answer to that, but if that’s so, then my question is—would there be
anything wrong with me saying that sometimes you might value parsimony and ease
of understanding over best fit? Because I don’t really know what the maximal
fit model buys you. It seems unintelligible to me. All of the terms are
involved in interactions to some extent, but there are 4-way interactions and
3-way interactions and 2-way interactions and I’m not sure even how to
understand it. A nice tree model showed that at higher levels of PA, KINDERWR
and NS affected scores. That I can understand, but that is not reflected in
this model.
An auxiliary question, probably easier to answer, is how could I do
hierarchical linear regression? The authors knew that PA would be the largest
contributor to the response variable because of previous research, and their
research question was whether PA would contribute anything AFTER the other 4
variables had already eaten their piece of the response variable pie. I know
how to do a hierarchical regression in SPSS, and want to show in parallel how
to do this in R. I did search R-help archives and didn’t find quite anything
that would just plain tell me how to do hierarchical linear regression.
Thanks in advance for any help.
Dr. Jenifer Larson-Hall
Assistant Professor of Linguistics
University of North Texas
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.