Re: [Scikit-learn-general] To standardize is the question ...

2013-06-02 Thread Andreas Mueller
On 06/01/2013 11:43 PM, o m wrote:
> Andy, on reading your tip, and reflecting on what I do, I'm tempted to 
> claim
> that standardization is very important, regardless ...
>
> Assume x0 is very important but  has a tiny range (-1/100, 1/100)
I think that something with a tiny range can be more "important" than 
other variables with a larger range
is a modelling assumption. In that case you should standardize. If all 
variables "have the same scale"
this shouldn't happen, I think.

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
Andy, on reading your tip, and reflecting on what I do, I'm tempted to claim
that standardization is very important, regardless ...

Assume x0 is very important but  has a tiny range (-1/100, 1/100) - all
other
variables  being significantly larger in range.
Lars/Lasso will drop x0 until the end, as the associated parameter
estimate is high. I'd therefore conclude that x0 must not be very
important.
Moreover, that conclusion would be re-enforced if the combined effect of
ten other useless
variables masked the  effect/contribution of x0. If I standardize
everything, Lars/Lasso
would put x0 in its place right from the start.

Is there a flaw?

Gael mentioned randomized-sparsity which I'm unfamiliar with, but would like
to investigate further.

Thanks.

Best Regards.


On 06/01/2013 07:51 PM, o m wrote:
> > > The main question is, what is your definition of an "important"
variable?
> > >
> > > Gilles
> > That's a good question;-) Seriously.
> >
> > I would define it - with many closely related variables - as a member
of a
> > set that gives you the best predictability.
> > LARS and LASSO with cross validation provide a good story along these
lines.
> > But performing  standardization can influence that.
> >
> > What do people typically do in these situations?
>
> The way I think about it is: do you believe that a priory all variables
> have the same importance? Then standardize.
> Do you believe that all variables share the same scale? Then don't
> standardize.
> This is basically true for all machine learning algorithms.
> For example, if your units are meters (or feet) does a change in the
> first variable by 1m have the same meaning
> as a change by 1m in the second? If so, you shouldn't standardize. If
> one variable only has small changes, these
> will be blown up compared to the others.

Hth,
Andy
--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Andreas Mueller
On 06/01/2013 07:51 PM, o m wrote:
> > The main question is, what is your definition of an "important" variable?
> >
> > Gilles
> That's a good question;-) Seriously.
>
> I would define it - with many closely related variables - as a member of a 
> set that gives you the best predictability.
> LARS and LASSO with cross validation provide a good story along these lines. 
> But performing  standardization can influence that.
>
> What do people typically do in these situations?
The way I think about it is: do you believe that a priory all variables 
have the same importance? Then standardize.
Do you believe that all variables share the same scale? Then don't 
standardize.
This is basically true for all machine learning algorithms.
For example, if your units are meters (or feet) does a change in the 
first variable by 1m have the same meaning
as a change by 1m in the second? If so, you shouldn't standardize. If 
one variable only has small changes, these
will be blown up compared to the others.

Hth,
Andy

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gael Varoquaux
Hi,

Unfortunately, statistics is not magic, and they are many situation in
which l1 recovery is not garanteed to work.

I cannot give magic answers, and I suggest that you think a lot about how
you can validate any findings using external sources. That said, I would
suggest, in general, to standardize your variables, and to use
randomized-sparsity, rather than simple l1, for feature selection:
http://scikit-learn.org/stable/modules/feature_selection.html#randomized-sparse-models

G

On Sat, Jun 01, 2013 at 08:22:41AM -0400, o m wrote:
> I've been playing around with Lasso and Lars, but there's something that
> bothers me about standardization.

> If I don't standardize to N(0, 1), these procedures indicate that a certain 
> set
> of variables are the most important. Yet, if I standardize, I get a completely
> different set of variables. As expected, the lars or lasso plots from varying
> alpha look very different. I know there's  a good reason for this, but then
> what's the right way to identify the important variables from a large set?

> I could take prediction quality on testing data, but there's a conflict if the
> important variables are so different under standardization.

> Any help or pointers is appreciated.

> Best Regards.


> --
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2

> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


-- 
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gilles Louppe
Hi,

The main question is, what is your definition of an "important" variable?

Gilles

On 1 June 2013 14:22, o m  wrote:
> I've been playing around with Lasso and Lars, but there's something that
> bothers me about standardization.
>
> If I don't standardize to N(0, 1), these procedures indicate that a certain
> set of variables are the most important. Yet, if I standardize, I get a
> completely different set of variables. As expected, the lars or lasso plots
> from varying alpha look very different. I know there's  a good reason for
> this, but then what's the right way to identify the important variables from
> a large set?
>
> I could take prediction quality on testing data, but there's a conflict if
> the important variables are so different under standardization.
>
> Any help or pointers is appreciated.
>
> Best Regards.
>
>
> --
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general