Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-19 Thread Lars Buitinck
2014-03-19 21:40 GMT+01:00 Satrajit Ghosh :
> this would mean that any tree-based model could generate differences based
> on preprocessing differences right?

Yes. I'm not sure why the threshold is there, but it's probably to
prevent generating too many splits in the face of noisy input. A
cleaner solution would generate the threshold from the range of the
feature, I guess (but I don't have the code in front of me right now).

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-19 Thread Satrajit Ghosh
thanks lars.

this would mean that any tree-based model could generate differences based
on preprocessing differences right?

cheers,

satra

On Sun, Mar 16, 2014 at 3:37 PM, Olivier Grisel wrote:

> 2014-03-16 0:23 GMT+01:00 Lars Buitinck :
> > 2014-03-15 21:53 GMT+01:00 Satrajit Ghosh :
> >> in many cases with fat data (small samples<50 x many features>10) i
> have
> >> found that standardizing helps quite a bit in case of extra trees. i
> still
> >> don't have a good understanding as to why this is the case. it could
> simply
> >> be small sample bias that i am seeing. but extra trees are also
> supposed to
> >> be resilient to overfitting.
> >>
> >> any thoughts?
> >
> > IIRC the scikit-learn tree learner discards a candidate split when the
> > difference between samples along the feature under consideration is
> > less than 1e-7. When you standardize, the learner might see a
> > different set of potential splits, and in particular, features with
> > extremely small variance wouldn't yield a split at all without
> > standardizing.
> >
> > I think you should be able the check the number of candidate splits
> > for a feature j with
> >
> > np.sum(np.diff(np.sort(X[:, j])) >= 1e-7)
>
> That might be a better explanation as standardizing is a linear
> transform (actually affine but the intercept disappears in the
> subtraction) so should not impact the uniform sampling outcome.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-16 Thread Olivier Grisel
2014-03-16 0:23 GMT+01:00 Lars Buitinck :
> 2014-03-15 21:53 GMT+01:00 Satrajit Ghosh :
>> in many cases with fat data (small samples<50 x many features>10) i have
>> found that standardizing helps quite a bit in case of extra trees. i still
>> don't have a good understanding as to why this is the case. it could simply
>> be small sample bias that i am seeing. but extra trees are also supposed to
>> be resilient to overfitting.
>>
>> any thoughts?
>
> IIRC the scikit-learn tree learner discards a candidate split when the
> difference between samples along the feature under consideration is
> less than 1e-7. When you standardize, the learner might see a
> different set of potential splits, and in particular, features with
> extremely small variance wouldn't yield a split at all without
> standardizing.
>
> I think you should be able the check the number of candidate splits
> for a feature j with
>
> np.sum(np.diff(np.sort(X[:, j])) >= 1e-7)

That might be a better explanation as standardizing is a linear
transform (actually affine but the intercept disappears in the
subtraction) so should not impact the uniform sampling outcome.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-15 Thread Lars Buitinck
2014-03-15 21:53 GMT+01:00 Satrajit Ghosh :
> in many cases with fat data (small samples<50 x many features>10) i have
> found that standardizing helps quite a bit in case of extra trees. i still
> don't have a good understanding as to why this is the case. it could simply
> be small sample bias that i am seeing. but extra trees are also supposed to
> be resilient to overfitting.
>
> any thoughts?

IIRC the scikit-learn tree learner discards a candidate split when the
difference between samples along the feature under consideration is
less than 1e-7. When you standardize, the learner might see a
different set of potential splits, and in particular, features with
extremely small variance wouldn't yield a split at all without
standardizing.

I think you should be able the check the number of candidate splits
for a feature j with

np.sum(np.diff(np.sort(X[:, j])) >= 1e-7)

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-15 Thread Satrajit Ghosh
thanks gilles,

that makes sense. i haven't checked random forest classification on these
data. i'll check that as well.

cheers,

satra


On Sat, Mar 15, 2014 at 5:51 PM, Gilles Louppe  wrote:

> Hi Satra,
>
> In case of Extra-Trees, changing the scale of features might change
> the result when the transform you apply distorts the original feature
> space. Drawing a threshold uniformly at random in the original
> [min;max] interval won't be equivalent to drawing a threshold in
> [f(min);f(max)] if f is non-linear. In the case of Random Forests
> though, this won't change anything.
>
> Hope this helps,
> Gilles
>
> On 15 March 2014 21:53, Satrajit Ghosh  wrote:
> > hi olivier,
> >
> > just a question on this statement:
> >
> >> Random Forest (and decision tree-based models in general) are scale
> >> independent.
> >
> >
> > in many cases with fat data (small samples<50 x many features>10) i
> have
> > found that standardizing helps quite a bit in case of extra trees. i
> still
> > don't have a good understanding as to why this is the case. it could
> simply
> > be small sample bias that i am seeing. but extra trees are also supposed
> to
> > be resilient to overfitting.
> >
> > any thoughts?
> >
> > cheers,
> >
> > satra
> >
> >
> >
> >
> --
> > Learn Graph Databases - Download FREE O'Reilly Book
> > "Graph Databases" is the definitive new guide to graph databases and
> their
> > applications. Written by three acclaimed leaders in the field,
> > this first edition is now available. Download your free book today!
> > http://p.sf.net/sfu/13534_NeoTech
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-15 Thread Gilles Louppe
Hi Satra,

In case of Extra-Trees, changing the scale of features might change
the result when the transform you apply distorts the original feature
space. Drawing a threshold uniformly at random in the original
[min;max] interval won't be equivalent to drawing a threshold in
[f(min);f(max)] if f is non-linear. In the case of Random Forests
though, this won't change anything.

Hope this helps,
Gilles

On 15 March 2014 21:53, Satrajit Ghosh  wrote:
> hi olivier,
>
> just a question on this statement:
>
>> Random Forest (and decision tree-based models in general) are scale
>> independent.
>
>
> in many cases with fat data (small samples<50 x many features>10) i have
> found that standardizing helps quite a bit in case of extra trees. i still
> don't have a good understanding as to why this is the case. it could simply
> be small sample bias that i am seeing. but extra trees are also supposed to
> be resilient to overfitting.
>
> any thoughts?
>
> cheers,
>
> satra
>
>
>
> --
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-15 Thread Satrajit Ghosh
hi olivier,

just a question on this statement:

Random Forest (and decision tree-based models in general) are scale
> independent.
>

in many cases with fat data (small samples<50 x many features>10) i
have found that standardizing helps quite a bit in case of extra trees. i
still don't have a good understanding as to why this is the case. it could
simply be small sample bias that i am seeing. but extra trees are also
supposed to be resilient to overfitting.

any thoughts?

cheers,

satra
--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-15 Thread Kevin Keraudren
Thanks a lot for this detailed answer!
Kind regards,
Kevin


Le 14/03/2014 16:37, Olivier Grisel a écrit :
> 2014-03-14 15:34 GMT+01:00 Kevin Keraudren :
>> Hi,
>>
>> I have a question related to the range of my input data for SVM or
>> Random Forests for classification:
>> I normalise my input vectors so that their euclidean norm is one, for
>> instance to limit the influence of the image size or intensity contrast.
>> I took the habit of then scaling them, multiplying them by a factor 1000
>> so that I have values between 0 and 1000 instead of 0 and 1, and thus
>> less values "close to zero". I guess it does not hurt to do so, but
>> would you know if it is useful? Do the SVM and Random Forests already do
>> some normalisation before starting to learn the data?
> Random Forest (and decision tree-based models in general) are scale 
> independent.
>
> SVMs are very sensitive to scaling in the sense that all features
> should vary in the same ranges. The actual width of the ranges should
> not matter much as long as it's does not cause numerical stability
> issues (both the 0-1 range and the 0-1000 ranges should work) and that
> you grid search hyperparameters such as C and gamma for their optimal
> values:
>
> http://scikit-learn.org/stable/model_selection.html
>
> You can use sklearn.preprocessing.StandardScaler to center the data
> (mean feature values are 0) and have each feature have a standard
> deviation of 1. Scaling between 0 and 1 works well too. This is
> implemented by MinMaxScaler. More discussion in the doc:
>
> http://scikit-learn.org/stable/modules/preprocessing.html
>
> I don't see any reason why the 0-1000 range would work better than the
> 0-1 range.
>
>> I have a similar questions for the Random Forests for regression: how is
>> the minimal MSE required for a split define? Here again, if I scale my
>> input by a factor 1000, shall I expect the resulting trees to be
>> different (excluding the random aspect of Random Forests)?
> The decision to stop splitting in a tree is controlled by:
>
> - max_depth
> - min_samples_leaf
> - min_samples_split
>
> Otherwise, the regression trees are fully developed.
>


--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] normalising/scaling input for SVM or Random Forests

2014-03-14 Thread Olivier Grisel
2014-03-14 15:34 GMT+01:00 Kevin Keraudren :
> Hi,
>
> I have a question related to the range of my input data for SVM or
> Random Forests for classification:
> I normalise my input vectors so that their euclidean norm is one, for
> instance to limit the influence of the image size or intensity contrast.
> I took the habit of then scaling them, multiplying them by a factor 1000
> so that I have values between 0 and 1000 instead of 0 and 1, and thus
> less values "close to zero". I guess it does not hurt to do so, but
> would you know if it is useful? Do the SVM and Random Forests already do
> some normalisation before starting to learn the data?

Random Forest (and decision tree-based models in general) are scale independent.

SVMs are very sensitive to scaling in the sense that all features
should vary in the same ranges. The actual width of the ranges should
not matter much as long as it's does not cause numerical stability
issues (both the 0-1 range and the 0-1000 ranges should work) and that
you grid search hyperparameters such as C and gamma for their optimal
values:

http://scikit-learn.org/stable/model_selection.html

You can use sklearn.preprocessing.StandardScaler to center the data
(mean feature values are 0) and have each feature have a standard
deviation of 1. Scaling between 0 and 1 works well too. This is
implemented by MinMaxScaler. More discussion in the doc:

http://scikit-learn.org/stable/modules/preprocessing.html

I don't see any reason why the 0-1000 range would work better than the
0-1 range.

> I have a similar questions for the Random Forests for regression: how is
> the minimal MSE required for a split define? Here again, if I scale my
> input by a factor 1000, shall I expect the resulting trees to be
> different (excluding the random aspect of Random Forests)?

The decision to stop splitting in a tree is controlled by:

- max_depth
- min_samples_leaf
- min_samples_split

Otherwise, the regression trees are fully developed.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general