Disclaimer : Short of having local statistical expertise at hand, I'm using this list because I use R for variable selection in the context of linear multiple regression but the questions I have relate more to basic statistics than to R per se. Please redirect me to another appropriate list if such a list exists.
I have the very common problem of identifying which (subset of) variables are important in a multiple linear regression problem. Googling and browsing around, I ended up with four methods I could easily access through R packages. My aim was to use those methods with continuous variables in a bid to find out which had an effect on my dependent variable, but more so I thought this would lead me to identify variables amenable to categorization. Example : I have population as a continuous variable so I figured if some variable selection algorithm flagged population as "significant", then I could try and apply a priori knowledge of the classes of population ranges I suspect could show different behaviours in the form of statistically different means as per ANOVA analysis. Coming back to my original variable selection need, here's what I did. First : regular lm Second : step Third : all-subsets (regsubsets, package leaps) Fourth : lasso (l1ce, package lasso2) Fifth : (I meant to use the lars package, but it does not allow for formulas; I know I could cast my dataset as matrices, but I didn't find an easy way of doing this and I figured I had enough options already) I'm trying to make sense of the information that is sent back at me from the summary calls. I'm looking for what variables are identified by each method, hoping to find comparable/complementary results. Given I'm a transient user of statistics, I find many of the "Details" sections on the help files lack specific instructions as to exactly how to interpret the results. Here we go *lm* > anova(tonlm) Analysis of Variance Table Response: Cout.ton Df Sum Sq Mean Sq F value Pr(>F) Tonnage 1 9720 9720 1.3497 0.2470437 Popul 1 112361 112361 15.6014 0.0001164 *** DensiteOcc 1 173350 173350 24.0699 2.245e-06 *** NbmRues.hab 1 280 280 0.0389 0.8438903 RFU.hab 1 183 183 0.0254 0.8734816 UO.MAMR.Precis 1 67161 67161 9.3254 0.0026428 ** t.CS.t.déchets.MAMR 1 24188 24188 3.3586 0.0686925 . Pct.Pot.CS.hab 1 78764 78764 10.9365 0.0011614 ** NbCentresTriDs100km 1 218725 218725 30.3702 1.380e-07 *** DistMarche 1 54114 54114 7.5137 0.0068110 ** Residuals 162 1166717 7202 *ANOVA applied on step(lm) * Response: Cout.ton Df Sum Sq Mean Sq F value Pr(>F) Popul 1 2917 2917 0.4038 0.525990 DensiteOcc 1 163741 163741 22.6730 4.179e-06 *** RFU.hab 1 83 83 0.0115 0.914899 UO.MAMR.Precis 1 168407 168407 23.3191 3.112e-06 *** Pct.Pot.CS.hab 1 122685 122685 16.9880 5.943e-05 *** NbCentresTriDs100km 1 190659 190659 26.4002 7.761e-07 *** DistMarche 1 65467 65467 9.0652 0.003015 ** Residuals 165 1191606 7222 Questions compared to lm above : What tells me which variable was selected first in the stepwise process ? Do I sort Pr(>F), the lowest value of which corresponds to the first variable ? Popul has a 3 star rating in lm and nothing in step. How do I interpret that ? * all-subsets regression * > summary(tonall)$cp [1] 37.338138 31.452375 25.300272 17.965950 13.751043 10.021910 8.455827 8.810498 9.273599 11.000000 > summary(tonall)$adjr2 [1] 0.2155976 0.2411378 0.2680048 0.2997661 0.3197652 0.3381030 0.3481410 0.3506880 0.3528339 0.3499369 > summary(tonall)$which[1,][summary(tonall)$which[1,]] (Intercept) DistMarche > summary(tonall)$which[2,][summary(tonall)$which[2,]] (Intercept), NbCentresTriDs100km, DistMarche > summary(tonall)$which[3,][summary(tonall)$which[3,]] (Intercept), Tonnage, UO.MAMR.Precis, NbCentresTriDs100km > summary(tonall)$which[4,][summary(tonall)$which[4,]] (Intercept), Tonnage, DensiteOcc, UO.MAMR.Precis, NbCentresTriDs100km > summary(tonall)$which[5,][summary(tonall)$which[5,]] (Intercept), Tonnage, DensiteOcc, UO.MAMR.Precis NbCentresTriDs100km, DistMarche omitting the remaining values up to which[10,] Questions w/r to lm and step : all-subsets says DistMarche is the single most important. That makes some sense as that variable had a two-star rating in both lm and step. But shouldn't the 3-star ratings in step be close to those in all-subsets ? For example, the best 3-variable model shows Tonnage popping up. Tonnage has no rating in lm and doesn't even show in step. Is it a matter of step being initialized with a variable such that Tonnage will never be considered whereas it is in an exhaustive all-subsets regression ? I'm puzzled. * lasso * > summary(tonlasso) ... Coefficients: Value Std. Error Z score Pr(>|Z|) (Intercept) 234.5910 36.0120 6.5142 0.0000 Tonnage -0.0155 0.0065 -2.3736 0.0176 Popul -0.0001 0.0008 -0.1639 0.8698 DensiteOcc -0.0719 0.0201 -3.5732 0.0004 NbmRues.hab -0.0130 0.0110 -1.1820 0.2372 RFU.hab 0.0005 0.0002 2.6586 0.0078 UO.MAMR.Precis 0.0025 0.0013 1.9025 0.0571 t.CS.t.déchets.MAMR -25.9682 51.2912 -0.5063 0.6127 Pct.Pot.CS.hab -69.0914 35.9212 -1.9234 0.0544 NbCentresTriDs100km -5.9783 2.3811 -2.5108 0.0120 DistMarche 0.2210 0.0805 2.7461 0.0060 I read that LASSO effectively allows for variable selection in the form of coefficients being set to 0. With the figures above, can I fix Values < 0.1 or 0.01 as 0, which would eliminate a number of variates ? Does a decreasing order of (absolute) coefficient values amount to determining the order of selection of variables according to LASSO ? Thank you for your patience and for pointers. Yves Moisan -- View this message in context: http://www.nabble.com/Variable-selection-in-R-tf4556775.html#a13004728 Sent from the R help mailing list archive at Nabble.com. [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.