Re: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-29 Thread Roman Yurchak
Thank you for all your responses!

In the LSA what is equivalent, I think, is
   - to apply a L2 normalization (not the StandardScaler) after the LSA
and then compute the cosine similarity between document vectors simply
as a dot product.
   - not apply the L2 normalization and call the `cosine_similarity`
function instead.

I have applied this normalization to the previous example, and it
produces indeed equivalent results (i.e. does not solve the problem).
Opening an issue on this for further discussion
   https://github.com/scikit-learn/scikit-learn/issues/7283

Thanks for your feedback!
-- 
Roman

On 28/08/16 18:20, Andy wrote:
> If you do "with_mean=False" it should be the same, right?
> 
> On 08/27/2016 12:20 PM, Olivier Grisel wrote:
>> I am not sure this is exactly the same because we do not center the
>> data in the TruncatedSVD case (as opposed to the real PCA case where
>> whitening is the same as calling StandardScaler).
>>
>> Having an option to normalize the transformed data by sigma seems like
>> a good idea but we should probably not call that whitening.
>>
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Ibrahim Dalal via scikit-learn
Hi,

Is there a way to extract impurity value of a node in
DecisionTreeClassifier? I am able to get this value in graph (using
export_grapgviz), but can't figure out how to get this value in my code. Is
there any attribute similar to estimator.tree_.children_left?

Thanks

On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu  wrote:

> That should be:
> node indicator = estimator.tree_.decision_path(X_test)
>
> PR welcome :)
>
> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
> scikit-learn@python.org> wrote:
>
>> Dear Developers,
>>
>> DecisionTreeClassifier.decision_path() as used here
>> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
>> is giving the following error:
>>
>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>> 'decision_path'
>>
>> Kindly help.
>>
>> Thanks
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Nelson Liu
Hi,
Yes, it's estimator.tree_.impurity

Nelson

On Mon, Aug 29, 2016, 09:18 Ibrahim Dalal via scikit-learn <
scikit-learn@python.org> wrote:

> Hi,
>
> Is there a way to extract impurity value of a node in
> DecisionTreeClassifier? I am able to get this value in graph (using
> export_grapgviz), but can't figure out how to get this value in my code. Is
> there any attribute similar to estimator.tree_.children_left?
>
> Thanks
>
> On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu  wrote:
>
>> That should be:
>> node indicator = estimator.tree_.decision_path(X_test)
>>
>> PR welcome :)
>>
>> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
>> scikit-learn@python.org> wrote:
>>
>>> Dear Developers,
>>>
>>> DecisionTreeClassifier.decision_path() as used here
>>> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
>>> is giving the following error:
>>>
>>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>>> 'decision_path'
>>>
>>> Kindly help.
>>>
>>> Thanks
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Ibrahim Dalal via scikit-learn
Thanks Nelson.

Is there any way to access number of training samples in a node?
Thanks

On Mon, Aug 29, 2016 at 8:53 PM, Nelson Liu  wrote:

> Hi,
> Yes, it's estimator.tree_.impurity
>
> Nelson
>
> On Mon, Aug 29, 2016, 09:18 Ibrahim Dalal via scikit-learn <
> scikit-learn@python.org> wrote:
>
>> Hi,
>>
>> Is there a way to extract impurity value of a node in
>> DecisionTreeClassifier? I am able to get this value in graph (using
>> export_grapgviz), but can't figure out how to get this value in my code. Is
>> there any attribute similar to estimator.tree_.children_left?
>>
>> Thanks
>>
>> On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu  wrote:
>>
>>> That should be:
>>> node indicator = estimator.tree_.decision_path(X_test)
>>>
>>> PR welcome :)
>>>
>>> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
>>> scikit-learn@python.org> wrote:
>>>
 Dear Developers,

 DecisionTreeClassifier.decision_path() as used here
 http://scikit-learn.org/dev/auto_examples/tree/unveil_
 tree_structure.html is giving the following error:

 AttributeError: 'DecisionTreeClassifier' object has no attribute
 'decision_path'

 Kindly help.

 Thanks
 ___
 scikit-learn mailing list
 scikit-learn@python.org
 https://mail.python.org/mailman/listinfo/scikit-learn

>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does NMF optimise over observed values

2016-08-29 Thread Andreas Mueller



On 08/28/2016 01:16 PM, Raphael C wrote:



On Sunday, August 28, 2016, Andy > wrote:




On 08/28/2016 12:29 PM, Raphael C wrote:

To give a little context from the web, see e.g.

http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/


 where
it explains:

"
A question might have come to your mind by now: if we find two
matrices \mathbf{P} and \mathbf{Q} such that \mathbf{P} \times
\mathbf{Q} approximates \mathbf{R}, isn’t that our predictions of
all the unseen ratings will all be zeros? In fact, we are not
really trying to come up with \mathbf{P} and \mathbf{Q} such that
we can reproduce \mathbf{R} exactly. Instead, we will only try to
minimise the errors of the observed user-item pairs.
"

Yes, the sklearn interface is not meant for matrix completion but
matrix-factorization.
There was a PR for some matrix completion for missing value
imputation at some point.

In general, scikit-learn doesn't really implement anything for
recommendation algorithms as that requires a different interface.


Thanks Andy. I just looked up that PR.

I was thinking simply producing a different factorisation optimised 
only over the observed values wouldn't need a new interface. That in 
itself would be hugely useful.
Depends. Usually you don't want to complete all values, but only compute 
a factorization. What do you return? Only the factors?
The PR implements completing everything, and that you can do with the 
transformer interface. I'm not sure what the status of the PR is,

but doing that with NMF instead of SVD would certainly also be interesting.

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Andreas Mueller



On 08/28/2016 03:23 PM, Nelson Liu wrote:

That should be:
node indicator = estimator.tree_.decision_path(X_test)

PR welcome :)

Was there a reason not to make this a "plot" example?
Would it take too long? Not having run examples by CI is a pretty big 
maintenance burden.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does NMF optimise over observed values

2016-08-29 Thread Tom DLT
If X is sparse, explicit zeros and missing-value zeros are **both**
considered as zeros in the objective functions.

Changing the objective function wouldn't need a new interface, yet I am not
sure the code change would be completely trivial.
The question is: do we want this new objective function in scikit-learn,
since we have no other recommendation-like algorithm?
If we agree that it would useful, feel free to send a PR.

Tom

2016-08-29 17:50 GMT+02:00 Andreas Mueller :

>
>
> On 08/28/2016 01:16 PM, Raphael C wrote:
>
>
>
> On Sunday, August 28, 2016, Andy  wrote:
>
>>
>>
>> On 08/28/2016 12:29 PM, Raphael C wrote:
>>
>> To give a little context from the web, see e.g. http://www.quuxlabs.com/b
>> log/2010/09/matrix-factorization-a-simple-tutorial-and-
>> implementation-in-python/ where it explains:
>>
>> "
>> A question might have come to your mind by now: if we find two matrices 
>> [image:
>> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
>> \mathbf{Q}] approximates [image: \mathbf{R}], isn’t that our predictions
>> of all the unseen ratings will all be zeros? In fact, we are not really
>> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
>> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
>> try to minimise the errors of the observed user-item pairs.
>> "
>>
>> Yes, the sklearn interface is not meant for matrix completion but
>> matrix-factorization.
>> There was a PR for some matrix completion for missing value imputation at
>> some point.
>>
>> In general, scikit-learn doesn't really implement anything for
>> recommendation algorithms as that requires a different interface.
>>
>
> Thanks Andy. I just looked up that PR.
>
> I was thinking simply producing a different factorisation optimised only
> over the observed values wouldn't need a new interface. That in itself
> would be hugely useful.
>
> Depends. Usually you don't want to complete all values, but only compute a
> factorization. What do you return? Only the factors?
> The PR implements completing everything, and that you can do with the
> transformer interface. I'm not sure what the status of the PR is,
> but doing that with NMF instead of SVD would certainly also be interesting.
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does NMF optimise over observed values

2016-08-29 Thread Raphael C
On Monday, August 29, 2016, Andreas Mueller  wrote:

>
>
> On 08/28/2016 01:16 PM, Raphael C wrote:
>
>
>
> On Sunday, August 28, 2016, Andy  > wrote:
>
>>
>>
>> On 08/28/2016 12:29 PM, Raphael C wrote:
>>
>> To give a little context from the web, see e.g. http://www.quuxlabs.com/b
>> log/2010/09/matrix-factorization-a-simple-tutorial-and-
>> implementation-in-python/ where it explains:
>>
>> "
>> A question might have come to your mind by now: if we find two matrices 
>> [image:
>> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
>> \mathbf{Q}] approximates [image: \mathbf{R}], isn’t that our predictions
>> of all the unseen ratings will all be zeros? In fact, we are not really
>> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
>> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
>> try to minimise the errors of the observed user-item pairs.
>> "
>>
>> Yes, the sklearn interface is not meant for matrix completion but
>> matrix-factorization.
>> There was a PR for some matrix completion for missing value imputation at
>> some point.
>>
>> In general, scikit-learn doesn't really implement anything for
>> recommendation algorithms as that requires a different interface.
>>
>
> Thanks Andy. I just looked up that PR.
>
> I was thinking simply producing a different factorisation optimised only
> over the observed values wouldn't need a new interface. That in itself
> would be hugely useful.
>
> Depends. Usually you don't want to complete all values, but only compute a
> factorization. What do you return? Only the factors?
>
> The PR implements completing everything, and that you can do with the
> transformer interface. I'm not sure what the status of the PR is,
> but doing that with NMF instead of SVD would certainly also be interesting.
>

I was thinking you would literally return W and H so that WH approx X.  The
user can then decide what to do with the factorisation just like when doing
SVD.

Raphael
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Ibrahim Dalal via scikit-learn
Hi,

What does the estimator.tree_.value array represent? I looked up the source
code but not able to get what it is. I am interested in the number of
training samples of each class in a given tree node.

Thanks

On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller  wrote:

>
>
> On 08/28/2016 03:23 PM, Nelson Liu wrote:
>
> That should be:
> node indicator = estimator.tree_.decision_path(X_test)
>
> PR welcome :)
>
> Was there a reason not to make this a "plot" example?
> Would it take too long? Not having run examples by CI is a pretty big
> maintenance burden.
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Nelson Liu
estimator.tree_.value gives the constant prediction of the tree at each
node. Think of it as what the tree would output if that node was a leaf.

I don't think we have a readily available way of checking the number of
training samples of each class in a given tree node. The closest thing
easily accessible is estimator.tree_.n_node_samples. Getting finer-grained
counts of the number of samples in each class would require modifying the
source code, I think.

On Mon, Aug 29, 2016 at 8:06 PM Ibrahim Dalal via scikit-learn <
scikit-learn@python.org> wrote:

> Hi,
>
> What does the estimator.tree_.value array represent? I looked up the
> source code but not able to get what it is. I am interested in the number
> of training samples of each class in a given tree node.
>
> Thanks
>
> On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller  wrote:
>
>>
>>
>> On 08/28/2016 03:23 PM, Nelson Liu wrote:
>>
>> That should be:
>> node indicator = estimator.tree_.decision_path(X_test)
>>
>> PR welcome :)
>>
>> Was there a reason not to make this a "plot" example?
>> Would it take too long? Not having run examples by CI is a pretty big
>> maintenance burden.
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with DecisionTreeClassifier

2016-08-29 Thread Joel Nothman
Or just running estimator.tree_.apply(X_train) and inferring from there.

On 30 August 2016 at 13:22, Nelson Liu  wrote:

> estimator.tree_.value gives the constant prediction of the tree at each
> node. Think of it as what the tree would output if that node was a leaf.
>
> I don't think we have a readily available way of checking the number of
> training samples of each class in a given tree node. The closest thing
> easily accessible is estimator.tree_.n_node_samples. Getting
> finer-grained counts of the number of samples in each class would require
> modifying the source code, I think.
>
> On Mon, Aug 29, 2016 at 8:06 PM Ibrahim Dalal via scikit-learn <
> scikit-learn@python.org> wrote:
>
>> Hi,
>>
>> What does the estimator.tree_.value array represent? I looked up the
>> source code but not able to get what it is. I am interested in the number
>> of training samples of each class in a given tree node.
>>
>> Thanks
>>
>> On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller 
>> wrote:
>>
>>>
>>>
>>> On 08/28/2016 03:23 PM, Nelson Liu wrote:
>>>
>>> That should be:
>>> node indicator = estimator.tree_.decision_path(X_test)
>>>
>>> PR welcome :)
>>>
>>> Was there a reason not to make this a "plot" example?
>>> Would it take too long? Not having run examples by CI is a pretty big
>>> maintenance burden.
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn