Re: [Scikit-learn-general] To standardize is the question ...
On 06/01/2013 11:43 PM, o m wrote: > Andy, on reading your tip, and reflecting on what I do, I'm tempted to > claim > that standardization is very important, regardless ... > > Assume x0 is very important but has a tiny range (-1/100, 1/100) I think that something with a tiny range can be more "important" than other variables with a larger range is a modelling assumption. In that case you should standardize. If all variables "have the same scale" this shouldn't happen, I think. -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] To standardize is the question ...
Andy, on reading your tip, and reflecting on what I do, I'm tempted to claim that standardization is very important, regardless ... Assume x0 is very important but has a tiny range (-1/100, 1/100) - all other variables being significantly larger in range. Lars/Lasso will drop x0 until the end, as the associated parameter estimate is high. I'd therefore conclude that x0 must not be very important. Moreover, that conclusion would be re-enforced if the combined effect of ten other useless variables masked the effect/contribution of x0. If I standardize everything, Lars/Lasso would put x0 in its place right from the start. Is there a flaw? Gael mentioned randomized-sparsity which I'm unfamiliar with, but would like to investigate further. Thanks. Best Regards. On 06/01/2013 07:51 PM, o m wrote: > > > The main question is, what is your definition of an "important" variable? > > > > > > Gilles > > That's a good question;-) Seriously. > > > > I would define it - with many closely related variables - as a member of a > > set that gives you the best predictability. > > LARS and LASSO with cross validation provide a good story along these lines. > > But performing standardization can influence that. > > > > What do people typically do in these situations? > > The way I think about it is: do you believe that a priory all variables > have the same importance? Then standardize. > Do you believe that all variables share the same scale? Then don't > standardize. > This is basically true for all machine learning algorithms. > For example, if your units are meters (or feet) does a change in the > first variable by 1m have the same meaning > as a change by 1m in the second? If so, you shouldn't standardize. If > one variable only has small changes, these > will be blown up compared to the others. Hth, Andy -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] To standardize is the question ...
On 06/01/2013 07:51 PM, o m wrote: > > The main question is, what is your definition of an "important" variable? > > > > Gilles > That's a good question;-) Seriously. > > I would define it - with many closely related variables - as a member of a > set that gives you the best predictability. > LARS and LASSO with cross validation provide a good story along these lines. > But performing standardization can influence that. > > What do people typically do in these situations? The way I think about it is: do you believe that a priory all variables have the same importance? Then standardize. Do you believe that all variables share the same scale? Then don't standardize. This is basically true for all machine learning algorithms. For example, if your units are meters (or feet) does a change in the first variable by 1m have the same meaning as a change by 1m in the second? If so, you shouldn't standardize. If one variable only has small changes, these will be blown up compared to the others. Hth, Andy -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] To standardize is the question ...
Hi, Unfortunately, statistics is not magic, and they are many situation in which l1 recovery is not garanteed to work. I cannot give magic answers, and I suggest that you think a lot about how you can validate any findings using external sources. That said, I would suggest, in general, to standardize your variables, and to use randomized-sparsity, rather than simple l1, for feature selection: http://scikit-learn.org/stable/modules/feature_selection.html#randomized-sparse-models G On Sat, Jun 01, 2013 at 08:22:41AM -0400, o m wrote: > I've been playing around with Lasso and Lars, but there's something that > bothers me about standardization. > If I don't standardize to N(0, 1), these procedures indicate that a certain > set > of variables are the most important. Yet, if I standardize, I get a completely > different set of variables. As expected, the lars or lasso plots from varying > alpha look very different. I know there's a good reason for this, but then > what's the right way to identify the important variables from a large set? > I could take prediction quality on testing data, but there's a conflict if the > important variables are so different under standardization. > Any help or pointers is appreciated. > Best Regards. > -- > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Gael Varoquaux Researcher, INRIA Parietal Laboratoire de Neuro-Imagerie Assistee par Ordinateur NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] To standardize is the question ...
Hi, The main question is, what is your definition of an "important" variable? Gilles On 1 June 2013 14:22, o m wrote: > I've been playing around with Lasso and Lars, but there's something that > bothers me about standardization. > > If I don't standardize to N(0, 1), these procedures indicate that a certain > set of variables are the most important. Yet, if I standardize, I get a > completely different set of variables. As expected, the lars or lasso plots > from varying alpha look very different. I know there's a good reason for > this, but then what's the right way to identify the important variables from > a large set? > > I could take prediction quality on testing data, but there's a conflict if > the important variables are so different under standardization. > > Any help or pointers is appreciated. > > Best Regards. > > > -- > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general