Re: [scikit-learn] scikit-learn Digest, Vol 19, Issue 37

2017-10-17 Thread Brown J.B. via scikit-learn
2017-10-18 12:18 GMT+09:00 Ismael Lemhadri :

> How about editing the various chunks of code concerned to add the option
> to scale the parameters, and set it by default to NOT scale? This would
> make what happens clear without the redundancy Andreas mentioned, and would
> add more convenience to the user shall they want to scale their data.
>

>From my perspectives:

That's a very nice, rational idea.
For end users, it preserves compatibility of existing codebases, but allows
both near-effortless updating of code for those who want to use
Scikit-learn's scaling as well as ease of application for new users and
tools.

One issue of caution would be where the scaling occurs, such as globally
before any cross-validation, or per-split with the transformation stored
and applied to prediction data per fold of CV.
One more keyword argument would need to be added to allow user
specification of this, and a state variable would have to be stored and
accessible from the methods of the parent estimator.

J.B.



>
>
>> Today's Topics:
>>
>>1. Re: Unclear help file about sklearn.decomposition.pca (Raphael C)
>>
>>
>> --
>>
>> Message: 1
>> Date: Tue, 17 Oct 2017 16:44:55 +0100
>> From: Raphael C 
>> To: Scikit-learn mailing list 
>> Subject: Re: [scikit-learn] Unclear help file about
>> sklearn.decomposition.pca
>> Message-ID:
>> 

Re: [scikit-learn] scikit-learn Digest, Vol 19, Issue 37

2017-10-17 Thread Ismael Lemhadri
How about editing the various chunks of code concerned to add the option to
scale the parameters, and set it by default to NOT scale? This would make
what happens clear without the redundancy Andreas mentioned, and would add
more convenience to the user shall they want to scale their data.

On Tue, Oct 17, 2017 at 8:45 AM,  wrote:

> Send scikit-learn mailing list submissions to
> scikit-learn@python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-requ...@python.org
>
> You can reach the person managing the list at
> scikit-learn-ow...@python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>1. Re: Unclear help file about sklearn.decomposition.pca (Raphael C)
>
>
> --
>
> Message: 1
> Date: Tue, 17 Oct 2017 16:44:55 +0100
> From: Raphael C 
> To: Scikit-learn mailing list 
> Subject: Re: [scikit-learn] Unclear help file about
> sklearn.decomposition.pca
> Message-ID:
> 

Re: [scikit-learn] Unclear help file about sklearn.decomposition.pca

2017-10-17 Thread Raphael C
How about including the scaling that people might want to use in the
User Guide examples?

Raphael

On 17 October 2017 at 16:40, Andreas Mueller  wrote:
> In general scikit-learn avoids automatic preprocessing.
> That's a convention to give the user more control and decrease surprising
> behavior (ostensibly).
> So scikit-learn will usually do what the algorithm is supposed to do, and
> nothing more.
>
> I'm not sure what the best way do document this is, as this has come up with
> different models.
> For example the R wrapper of libsvm does automatic scaling, while we apply
> the SVM.
>
> We could add "this model does not do any automatic preprocessing" to all
> docstrings, but that seems
> a bit redundant. We could add it to
> https://github.com/scikit-learn/scikit-learn/pull/9517, but
> that is probably not where you would have looked.
>
> Other suggestions welcome.
>
>
> On 10/16/2017 03:29 PM, Ismael Lemhadri wrote:
>
> Thank you all for your feedback.
> The initial problem I came with wasnt the definition of PCA but what the
> sklearn method does. In practice I would always make sure the data is both
> centered and scaled before performing PCA. This is the recommended method
> because without scaling, the biggest direction could wrongly seem to explain
> a huge fraction of the variance.
> So my point was simply to clarify in the help file and the user guide what
> the PCA class does precisely to leave no unclarity to the reader. Moving
> forward I have now submitted a pull request on github as initially suggested
> by Roman on this thread.
> Best,
> Ismael
>
> On Mon, 16 Oct 2017 at 11:49 AM,  wrote:
>>
>> Send scikit-learn mailing list submissions to
>> scikit-learn@python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>> scikit-learn-requ...@python.org
>>
>> You can reach the person managing the list at
>> scikit-learn-ow...@python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>1. Re: 1. Re: unclear help file for sklearn.decomposition.pca
>>   (Andreas Mueller)
>>2. Re: 1. Re: unclear help file for sklearn.decomposition.pca
>>   (Oliver Tomic)
>>
>>
>> --
>>
>> Message: 1
>> Date: Mon, 16 Oct 2017 14:44:51 -0400
>> From: Andreas Mueller 
>> To: scikit-learn@python.org
>> Subject: Re: [scikit-learn] 1. Re: unclear help file for
>> sklearn.decomposition.pca
>> Message-ID: <35142868-fce9-6cb3-eba3-015a0b106...@gmail.com>
>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>
>>
>>
>> On 10/16/2017 02:27 PM, Ismael Lemhadri wrote:
>> > @Andreas Muller:
>> > My references do not assume centering, e.g.
>> > http://ufldl.stanford.edu/wiki/index.php/PCA
>> > any reference?
>> >
>> It kinda does but is not very clear about it:
>>
>> This data has already been pre-processed so that each of the
>> features\textstyle x_1and\textstyle x_2have about the same mean (zero)
>> and variance.
>>
>>
>>
>> Wikipedia is much clearer:
>> Consider a datamatrix
>> ,*X*, with
>> column-wise zeroempirical mean
>> (the sample mean of each
>> column has been shifted to zero), where each of the/n/rows represents a
>> different repetition of the experiment, and each of the/p/columns gives
>> a particular kind of feature (say, the results from a particular sensor).
>> https://en.wikipedia.org/wiki/Principal_component_analysis#Details
>>
>> I'm a bit surprised to find that ESL says "The SVD of the centered
>> matrix X is another way of expressing the principal components of the
>> variables in X",
>> so they assume scaling? They don't really have a great treatment of PCA,
>> though.
>>
>> Bishop  and Murphy
>>  are pretty clear
>> that they subtract the mean (or assume zero mean) but don't standardize.
>> -- next part --
>> An HTML attachment was scrubbed...
>> URL:
>> 
>>
>> --
>>
>> Message: 2
>> Date: Mon, 16 Oct 2017 20:48:29 +0200
>> From: Oliver Tomic 
>> To: "Scikit-learn mailing list" 
>> Cc: 
>> Subject: Re: [scikit-learn] 1. Re: unclear help file for
>> sklearn.decomposition.pca
>> Message-ID: <15f26840d65.e97b33c25239.3934951873824890...@zoho.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Dear Ismael,
>>
>>
>>
>> PCA should always involve at