[R] Variable selection in R

Yves Moisan Tue, 02 Oct 2007 11:30:55 -0700

Disclaimer : Short of having local statistical expertise at hand, I'm using
this list because I use R for variable selection in the context of linear
multiple regression but the questions I have relate more to basic statistics
than to R per se.  Please redirect me to another appropriate list if such a
list exists.



I have the very common problem of identifying which (subset of) variables
are important in a multiple linear regression problem.  Googling and
browsing around, I ended up with four methods I could easily access through
R packages.  My aim was to use those methods with continuous variables in a
bid to find out which had an effect on my dependent variable, but more so I
thought this would lead me to identify variables amenable to categorization. 
Example : I have population as a continuous variable so I figured if some
variable selection algorithm flagged population as "significant", then I
could try and apply a priori knowledge of the classes of population ranges I
suspect could show different behaviours in the form of statistically
different means as per ANOVA analysis.  Coming back to my original variable
selection need, here's what I did.


First : regular lm
Second : step
Third : all-subsets (regsubsets, package leaps)
Fourth : lasso (l1ce, package lasso2)
Fifth : (I meant to use the lars package, but it does not allow for
formulas; I know I could cast my dataset as matrices, but I didn't find an
easy way of doing this and I figured I had enough options already)


I'm trying to make sense of the information that is sent back at me from the
summary calls.  I'm looking for what variables are identified by each
method, hoping to find comparable/complementary results.  Given I'm a
transient user of statistics, I find many of the "Details" sections on the
help files lack specific instructions as to exactly how to interpret the
results.  Here we go


*lm*

> 
anova(tonlm)
Analysis of Variance Table

Response: Cout.ton
                     Df  Sum Sq Mean Sq F value    Pr(>F)    
Tonnage               1    9720    9720  1.3497 0.2470437    
Popul                 1  112361  112361 15.6014 0.0001164 ***
DensiteOcc            1  173350  173350 24.0699 2.245e-06 ***
NbmRues.hab           1     280     280  0.0389 0.8438903    
RFU.hab               1     183     183  0.0254 0.8734816    
UO.MAMR.Precis        1   67161   67161  9.3254 0.0026428 ** 
t.CS.t.dÃ©chets.MAMR   1   24188   24188  3.3586 0.0686925 .  
Pct.Pot.CS.hab        1   78764   78764 10.9365 0.0011614 ** 
NbCentresTriDs100km   1  218725  218725 30.3702 1.380e-07 ***
DistMarche            1   54114   54114  7.5137 0.0068110 ** 
Residuals           162 1166717    7202


*ANOVA applied on step(lm) *

Response: Cout.ton
                     Df  Sum Sq Mean Sq F value    Pr(>F)    
Popul                 1    2917    2917  0.4038  0.525990    
DensiteOcc            1  163741  163741 22.6730 4.179e-06 ***
RFU.hab               1      83      83  0.0115  0.914899    
UO.MAMR.Precis        1  168407  168407 23.3191 3.112e-06 ***
Pct.Pot.CS.hab        1  122685  122685 16.9880 5.943e-05 ***
NbCentresTriDs100km   1  190659  190659 26.4002 7.761e-07 ***
DistMarche            1   65467   65467  9.0652  0.003015 ** 
Residuals           165 1191606    7222


Questions compared to lm above : What tells me which variable was selected
first in the stepwise process ?  Do I sort Pr(>F), the lowest value of which
corresponds to the first variable ?  Popul has a 3 star rating in lm and
nothing in step.  How do I interpret that ?

* all-subsets regression *

> summary(tonall)$cp

 [1] 37.338138 31.452375 25.300272 17.965950 13.751043 10.021910  8.455827 
8.810498  9.273599 11.000000
> summary(tonall)$adjr2

 [1] 0.2155976 0.2411378 0.2680048 0.2997661 0.3197652 0.3381030 0.3481410
0.3506880 0.3528339 0.3499369

> summary(tonall)$which[1,][summary(tonall)$which[1,]]

(Intercept)  DistMarche 
> summary(tonall)$which[2,][summary(tonall)$which[2,]]

(Intercept), NbCentresTriDs100km, DistMarche 
> summary(tonall)$which[3,][summary(tonall)$which[3,]]

(Intercept), Tonnage,  UO.MAMR.Precis, NbCentresTriDs100km  
> summary(tonall)$which[4,][summary(tonall)$which[4,]]

(Intercept), Tonnage, DensiteOcc, UO.MAMR.Precis, NbCentresTriDs100km  
> summary(tonall)$which[5,][summary(tonall)$which[5,]]

(Intercept), Tonnage, DensiteOcc, UO.MAMR.Precis NbCentresTriDs100km,
DistMarche  


omitting the remaining values up to which[10,]


Questions w/r to lm and step : all-subsets says DistMarche is the single
most important.  That makes some sense as that variable had a two-star
rating in both lm and step.  But shouldn't the 3-star ratings in step be
close to those in all-subsets ?  For example,  the best 3-variable model
shows Tonnage popping up.  Tonnage has no rating in lm and doesn't even show
in step.  Is it a matter of step being initialized with a variable such that
Tonnage will never be considered whereas it is in an exhaustive all-subsets
regression ? I'm puzzled.

* lasso *

> summary(tonlasso)

...

Coefficients:
                    Value    Std. Error Z score  Pr(>|Z|)
(Intercept)         234.5910  36.0120     6.5142   0.0000
Tonnage              -0.0155   0.0065    -2.3736   0.0176
Popul                -0.0001   0.0008    -0.1639   0.8698
DensiteOcc           -0.0719   0.0201    -3.5732   0.0004
NbmRues.hab          -0.0130   0.0110    -1.1820   0.2372
RFU.hab               0.0005   0.0002     2.6586   0.0078
UO.MAMR.Precis        0.0025   0.0013     1.9025   0.0571
t.CS.t.dÃ©chets.MAMR -25.9682  51.2912    -0.5063   0.6127
Pct.Pot.CS.hab      -69.0914  35.9212    -1.9234   0.0544
NbCentresTriDs100km  -5.9783   2.3811    -2.5108   0.0120
DistMarche            0.2210   0.0805     2.7461   0.0060


I read that LASSO effectively allows for variable selection in the form of
coefficients being set to 0.  With the figures above, can I fix Values < 0.1
or 0.01 as 0, which would eliminate a number of variates ?  Does a
decreasing order of (absolute) coefficient values amount to determining the
order of selection of variables according to LASSO ? 


Thank you for your patience and for pointers.


Yves Moisan
-- 
View this message in context: 
http://www.nabble.com/Variable-selection-in-R-tf4556775.html#a13004728
Sent from the R help mailing list archive at Nabble.com.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Variable selection in R

Reply via email to