[Scikit-learn-general] PEP8 alert!

2014-01-16 Thread Olivier Grisel
PEP8 violations are reaching a critical level causing a risk of code
style meltdown:

https://jenkins.shiningpanda-ci.com/scikit-learn/job/python-2.7-numpy-1.6.2-scipy-0.10.1/violations/

We should be more careful in checking pep8 compliance prior to merging
PRs from now on :)

And remember kids: nobody expects the PEP8 inquisition.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Thanks for your responses.

@Kyle:
At the risk of sounding really naive, I'd like to make the following
comments. I'm referring to this paper that Sukru had posted,
http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based
collaborative filtering. I don't think there is really any need for masking
the items that are not selected by the target user (or the user for which
you need to predict the item rating) here. I believe it would work for
dense cases too. Lets look at a sample session here.

from sklearn.recsys import item_cf  # Tentative names.
clf = item_cf()  # Here arguments like similarity criteria, number of
recommendations can be given in the __init__
# Lets say there are n users who have have already rated,
# X is an 2-D array with the first dimension of n, the second can vary
according to the number of items they have
# rated.
# y is the ratings they have provided. This can be either binary like
+1 or -1 , or continuous.
clf.fit(X, y)
# After doing clf.fit(X, y) , an attribute clf.items_ would return the
total number of items.
clf.predict(x)  # This will return the top n recommendations of x
# For each item in clf.items_ provided item is not in x, similarity is
calculated by taking the top k similar items in x.

For user based CF, yes we need to provide a mask for the item for which we
need to predict the rating, but I suppose that can be provided in the
__init__ (can't it)?

@Alex and Nick: Thanks for your references, I'll have a look right now.

However a point I don't intutively understand what clf.transform() /
clf.fit_transform must be doing in these cases. Any pointers?  Considering
the mentor problem, I don't think that would be a problem if the community
is genuinely interested in this project. If I do get a +1, I can start
thinking about the timeline, algorithms I'd like to implement etc. I'm
really looking forward to extending my really minor scikit-learn work right
now as part of GSoC.
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Kyle Kastner
So X is the array of existing ratings, would y be a 2D array then? If not,
how do you map the ratings given back to a single user (since y is
typically, to my knowledge, 1D in sklearn)?

I am still a little confused, but your example helped. Can you could go
into a little more detail on X, x, and y?

Let's say for an example of 5 users, 11 total items. That would make X a
5x11 matrix, right? What about y and x?


On Thu, Jan 16, 2014 at 8:29 AM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:

 Thanks for your responses.

 @Kyle:
 At the risk of sounding really naive, I'd like to make the following
 comments. I'm referring to this paper that Sukru had posted,
 http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based
 collaborative filtering. I don't think there is really any need for masking
 the items that are not selected by the target user (or the user for which
 you need to predict the item rating) here. I believe it would work for
 dense cases too. Lets look at a sample session here.

 from sklearn.recsys import item_cf  # Tentative names.
 clf = item_cf()  # Here arguments like similarity criteria, number of
 recommendations can be given in the __init__
 # Lets say there are n users who have have already rated,
 # X is an 2-D array with the first dimension of n, the second can vary
 according to the number of items they have
 # rated.
 # y is the ratings they have provided. This can be either binary like
 +1 or -1 , or continuous.
 clf.fit(X, y)
 # After doing clf.fit(X, y) , an attribute clf.items_ would return the
 total number of items.
 clf.predict(x)  # This will return the top n recommendations of x
 # For each item in clf.items_ provided item is not in x, similarity is
 calculated by taking the top k similar items in x.

 For user based CF, yes we need to provide a mask for the item for which we
 need to predict the rating, but I suppose that can be provided in the
 __init__ (can't it)?

 @Alex and Nick: Thanks for your references, I'll have a look right now.

 However a point I don't intutively understand what clf.transform() /
 clf.fit_transform must be doing in these cases. Any pointers?  Considering
 the mentor problem, I don't think that would be a problem if the community
 is genuinely interested in this project. If I do get a +1, I can start
 thinking about the timeline, algorithms I'd like to implement etc. I'm
 really looking forward to extending my really minor scikit-learn work right
 now as part of GSoC.



 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Well y can be 2-D too, there are estimators like MultiTaskElasticNet
especially meant for multi-task y.

I was thinking something along these lines. Lets say
[ham, spam, ram, bam, tam] are the five items.

and if first user gives
ham - 2
spam - 3

the second user gives
ram - 1
bam - -3
tam - 4

then I was thinking X = [[ham, spam], [ram, bam, tam]] and y =
[[2, 3], []]













On Thu, Jan 16, 2014 at 8:56 PM, Kyle Kastner kastnerk...@gmail.com wrote:

 So X is the array of existing ratings, would y be a 2D array then? If not,
 how do you map the ratings given back to a single user (since y is
 typically, to my knowledge, 1D in sklearn)?

 I am still a little confused, but your example helped. Can you could go
 into a little more detail on X, x, and y?

 Let's say for an example of 5 users, 11 total items. That would make X a
 5x11 matrix, right? What about y and x?


 On Thu, Jan 16, 2014 at 8:29 AM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:

 Thanks for your responses.

 @Kyle:
 At the risk of sounding really naive, I'd like to make the following
 comments. I'm referring to this paper that Sukru had posted,
 http://www.stat.osu.edu/~dmsl/Sarwar_2001.pdf which is item based
 collaborative filtering. I don't think there is really any need for masking
 the items that are not selected by the target user (or the user for which
 you need to predict the item rating) here. I believe it would work for
 dense cases too. Lets look at a sample session here.

 from sklearn.recsys import item_cf  # Tentative names.
 clf = item_cf()  # Here arguments like similarity criteria, number of
 recommendations can be given in the __init__
 # Lets say there are n users who have have already rated,
 # X is an 2-D array with the first dimension of n, the second can
 vary according to the number of items they have
 # rated.
 # y is the ratings they have provided. This can be either binary like
 +1 or -1 , or continuous.
 clf.fit(X, y)
 # After doing clf.fit(X, y) , an attribute clf.items_ would return
 the total number of items.
 clf.predict(x)  # This will return the top n recommendations of x
 # For each item in clf.items_ provided item is not in x, similarity
 is calculated by taking the top k similar items in x.

 For user based CF, yes we need to provide a mask for the item for which
 we need to predict the rating, but I suppose that can be provided in the
 __init__ (can't it)?

 @Alex and Nick: Thanks for your references, I'll have a look right now.

 However a point I don't intutively understand what clf.transform() /
 clf.fit_transform must be doing in these cases. Any pointers?  Considering
 the mentor problem, I don't think that would be a problem if the community
 is genuinely interested in this project. If I do get a +1, I can start
 thinking about the timeline, algorithms I'd like to implement etc. I'm
 really looking forward to extending my really minor scikit-learn work right
 now as part of GSoC.



 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
I'm extremely sorry, that message got sent half way through. (I pressed
Ctrl + Enter by mistake)
X = [[ham, spam], [ram, bam, tam]], and y = [[2, 3], [1, -3, 4]]

and we do clf.fit(X, y)
Suppose we would like to predict, what we would recommend the user x who
has already rated ram as 1 and bam as 2. we do clf.predict([ram,
bam], [1, -3]) and it would give the output. (Both parameters are
required)

I do not know however what clf.transform() or clf.fit_transform() would do
(as of know), and the meat of the computation would be done in clf.predict()

-- 
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Suggestion to add author names/emails at the bottom of module documentations

2014-01-16 Thread Issam
Hi scikit-learn editors,

Any documentation can have mistakes, but it's important to address them 
quickly and efficiently. One plausible way is to contact the author of 
an  erroneous text to have him make proper changes. But, for all I know, 
scikit-learn's documentation
lacks authors' information; which could otherwise help users or experts 
in the field to readily request to address false information. Clearly, 
they can address such issues through scikit's mailing list, but not many 
people are willing to take that extra mile
of looking for the mailing list address, waiting for approval, and 
keeping track of the mailing list posts, especially professors who are 
rather busy with their research work.

How about if one opens a PR to add the corresponding author's 
information at the bottom of each module 's documentation (just like in 
newspapers) ?

Thank you! :) :)



--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread nmura...@masonlive.gmu.edu
I agree  that sparse matrices need to be supported as one of the main 
properties inherent to the user/item rating matrix in recommender systems is 
its sparsity. This sparsity is what has given rise to such a large scale of 
research in the field. Hence this property would have to be taken advantage of 
because if not, since we have to deal with matrices, similarity calculations 
would have complexity through the roof (although there are ways to overcome 
this by using item-item cf techniques where similarity calculation is done 
offline but nevertheless is still expensive).

Possibly solutions in my opinion:
   1 Support dense and sparse matrices but I am not sure if such an 
implementation can be directly plugged into sklearn (because of the sparse 
matrix support.)

2 Distributed recommender systems (just provide the ability for people to 
distribute their similarity calculations.) This can be done using MRJob a 
hadoop-streaming wrapper for python. This is also a current field of research 
and I'm sure if you look into it you will find quite a lot of literature on the 
topic.

3 I am currently also trying to look into this library called scikit-crab 
which was started based upon a similar plan but I heard the developers are 
rewriting the library currently and it might not be open to the community for 
active development at present (not sure about this though). But I just 
mentioned it thinking maybe if you took a look at the code, you would get some 
more ideas about what improvements could be made. 
https://github.com/muricoca/crab


From: Kyle Kastner [kastnerk...@gmail.com]
Sent: Wednesday, January 15, 2014 1:42 PM
To: scikit-learn-general@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems (from talking 
to Jake IIRC) is how to handle the missing values in the input array. You 
would either need a mask, or some kind of indexing system for describing which 
value goes where in the input matrix. Either way, this extra argument would be 
a requirement for CF, but not for the existing algorithms in sklearn.

Maybe it would only operate on sparse arrays, and infer that the values which 
are missing are the ones to be imputed (recommended)? But not supporting 
dense arrays would basically be the opposite of other modules in sklearn, where 
dense input is default. Maybe someone can comment on this?

I don't know how well this lines up with the existing API/functionality and the 
future directions there, but how to deal with the missing values is probably 
the primary concern for implementing CF algorithms in sklearn IMO.


On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar 
manojkumarsivaraj...@gmail.commailto:manojkumarsivaraj...@gmail.com wrote:
Hello,

First of all, thanks to the scikit-learn community for guiding new developers. 
I'm thankful for all the help that I've got with my Pull Requests till now.

I hope that this is the right place to discuss GSoC related ideas (I've idled 
at the scikit-learn irc channel for quite a few occasions, but I could not meet 
any core developer). I was browsing through the threads of last year, when I 
found this idea related to collaborative filtering (CF) quite interesting, 
http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though this 
was sadly not accepted.

If the scikit-learn community is still enthusiastic about a recsys module with 
CF algorithms implemented, I would love this to be my GSoC proposal and we 
could discuss more about the algorithms, gelling with the present sklearn API, 
how much we could possibly fit in a 3 month period etc.

Awaiting a reply.

--
Regards,
Manoj Kumar,
Mech Undergrad
http://manojbits.wordpress.com

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.netmailto:Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Suggestion to add author names/emails at the bottom of module documentations

2014-01-16 Thread Vlad Niculae
I would rather have this sorted out through the github issue tracker.

I don't think it's a good idea to encourage users to e-mail individual 
developers. Someone else could have the expertise and do the change 
confidently.

My 2c,
Vlad

On Thu Jan 16 18:12:05 2014, Issam wrote:
 Hi scikit-learn editors,

 Any documentation can have mistakes, but it's important to address them
 quickly and efficiently. One plausible way is to contact the author of
 an  erroneous text to have him make proper changes. But, for all I know,
 scikit-learn's documentation
 lacks authors' information; which could otherwise help users or experts
 in the field to readily request to address false information. Clearly,
 they can address such issues through scikit's mailing list, but not many
 people are willing to take that extra mile
 of looking for the mailing list address, waiting for approval, and
 keeping track of the mailing list posts, especially professors who are
 rather busy with their research work.

 How about if one opens a PR to add the corresponding author's
 information at the bottom of each module 's documentation (just like in
 newspapers) ?

 Thank you! :) :)



 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Kyle Kastner
@Manoj
The predict stage taking 2 parameters is what I was talking about - are
there any other estimators that need anything more than a single matrix to
do a prediction? I do not recall any - this would be something particular
to CF. Maybe you could recast it as a matrix with alternating rows of
item,rating but that is still a particular for CF.

Whether that is OK as far sklearn's API is concerned is not for me to
decide. I would also expect it to be closely tied with DictVectorizer or
something like it, probably more so than most other algorithms (though this
is not a big deal IMO) to get categorical labels.

@nmuralid
I agree totally - last number I saw was that the typical matrix for
something like Amazon is 99% sparse? I don't remember where I read it
though. Looking at crab, it seems like they are trying to do sklearn-style
API specifically for collaborative filtering. Not sure where the name crab
comes in, but it is definitely worth looking at.

Kyle


On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu 
nmura...@masonlive.gmu.edu wrote:

  I agree  that sparse matrices need to be supported as one of the main
 properties inherent to the user/item rating matrix in recommender systems
 is its sparsity. This sparsity is what has given rise to such a large scale
 of research in the field. Hence this property would have to be taken
 advantage of because if not, since we have to deal with matrices,
 similarity calculations would have complexity through the roof (although
 there are ways to overcome this by using item-item cf techniques where
 similarity calculation is done offline but nevertheless is still
 expensive).

  Possibly solutions in my opinion:
1 Support dense and sparse matrices but I am not sure if such an
 implementation can be directly plugged into sklearn (because of the sparse
 matrix support.)

  2 Distributed recommender systems (just provide the ability for people
 to distribute their similarity calculations.) This can be done using MRJob
 a hadoop-streaming wrapper for python. This is also a current field of
 research and I'm sure if you look into it you will find quite a lot of
 literature on the topic.

  3 I am currently also trying to look into this library called
 scikit-crab which was started based upon a similar plan but I heard the
 developers are rewriting the library currently and it might not be open to
 the community for active development at present (not sure about this
 though). But I just mentioned it thinking maybe if you took a look at the
 code, you would get some more ideas about what improvements could be made.
 https://github.com/muricoca/crab

   --
 *From:* Kyle Kastner [kastnerk...@gmail.com]
 *Sent:* Wednesday, January 15, 2014 1:42 PM
 *To:* scikit-learn-general@lists.sourceforge.net
 *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems (from
 talking to Jake IIRC) is how to handle the missing values in the input
 array. You would either need a mask, or some kind of indexing system for
 describing which value goes where in the input matrix. Either way, this
 extra argument would be a requirement for CF, but not for the existing
 algorithms in sklearn.

  Maybe it would only operate on sparse arrays, and infer that the values
 which are missing are the ones to be imputed (recommended)? But not
 supporting dense arrays would basically be the opposite of other modules in
 sklearn, where dense input is default. Maybe someone can comment on this?

  I don't know how well this lines up with the existing API/functionality
 and the future directions there, but how to deal with the missing values is
 probably the primary concern for implementing CF algorithms in sklearn IMO.


 On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:

   Hello,

  First of all, thanks to the scikit-learn community for guiding new
 developers. I'm thankful for all the help that I've got with my Pull
 Requests till now.

  I hope that this is the right place to discuss GSoC related ideas (I've
 idled at the scikit-learn irc channel for quite a few occasions, but I
 could not meet any core developer). I was browsing through the threads of
 last year, when I found this idea related to collaborative filtering (CF)
 quite interesting,
 http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
 this was sadly not accepted.

  If the scikit-learn community is still enthusiastic about a recsys
 module with CF algorithms implemented, I would love this to be my GSoC
 proposal and we could discuss more about the algorithms, gelling with the
 present sklearn API, how much we could possibly fit in a 3 month period etc.

  Awaiting a reply.

 --
 Regards,
 Manoj Kumar,
 Mech Undergrad
 http://manojbits.wordpress.com


 --
 CenturyLink Cloud: The Leader in 

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Yes indeed, getting two parameters for predict would be specific to CF.
That was the most obvious idea that came to my mind. I would like to hear
other's opinions also regarding the API, and the feasibility of such a
project.


On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner kastnerk...@gmail.comwrote:

 @Manoj
 The predict stage taking 2 parameters is what I was talking about - are
 there any other estimators that need anything more than a single matrix to
 do a prediction? I do not recall any - this would be something particular
 to CF. Maybe you could recast it as a matrix with alternating rows of
 item,rating but that is still a particular for CF.

 Whether that is OK as far sklearn's API is concerned is not for me to
 decide. I would also expect it to be closely tied with DictVectorizer or
 something like it, probably more so than most other algorithms (though this
 is not a big deal IMO) to get categorical labels.

 @nmuralid
 I agree totally - last number I saw was that the typical matrix for
 something like Amazon is 99% sparse? I don't remember where I read it
 though. Looking at crab, it seems like they are trying to do sklearn-style
 API specifically for collaborative filtering. Not sure where the name crab
 comes in, but it is definitely worth looking at.

 Kyle


 On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu 
 nmura...@masonlive.gmu.edu wrote:

  I agree  that sparse matrices need to be supported as one of the main
 properties inherent to the user/item rating matrix in recommender systems
 is its sparsity. This sparsity is what has given rise to such a large scale
 of research in the field. Hence this property would have to be taken
 advantage of because if not, since we have to deal with matrices,
 similarity calculations would have complexity through the roof (although
 there are ways to overcome this by using item-item cf techniques where
 similarity calculation is done offline but nevertheless is still
 expensive).

  Possibly solutions in my opinion:
1 Support dense and sparse matrices but I am not sure if such an
 implementation can be directly plugged into sklearn (because of the sparse
 matrix support.)

  2 Distributed recommender systems (just provide the ability for people
 to distribute their similarity calculations.) This can be done using MRJob
 a hadoop-streaming wrapper for python. This is also a current field of
 research and I'm sure if you look into it you will find quite a lot of
 literature on the topic.

  3 I am currently also trying to look into this library called
 scikit-crab which was started based upon a similar plan but I heard the
 developers are rewriting the library currently and it might not be open to
 the community for active development at present (not sure about this
 though). But I just mentioned it thinking maybe if you took a look at the
 code, you would get some more ideas about what improvements could be made.
 https://github.com/muricoca/crab

   --
 *From:* Kyle Kastner [kastnerk...@gmail.com]
 *Sent:* Wednesday, January 15, 2014 1:42 PM
 *To:* scikit-learn-general@lists.sourceforge.net
 *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems
 (from talking to Jake IIRC) is how to handle the missing values in the
 input array. You would either need a mask, or some kind of indexing system
 for describing which value goes where in the input matrix. Either way, this
 extra argument would be a requirement for CF, but not for the existing
 algorithms in sklearn.

  Maybe it would only operate on sparse arrays, and infer that the values
 which are missing are the ones to be imputed (recommended)? But not
 supporting dense arrays would basically be the opposite of other modules in
 sklearn, where dense input is default. Maybe someone can comment on this?

  I don't know how well this lines up with the existing API/functionality
 and the future directions there, but how to deal with the missing values is
 probably the primary concern for implementing CF algorithms in sklearn IMO.


 On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:

   Hello,

  First of all, thanks to the scikit-learn community for guiding new
 developers. I'm thankful for all the help that I've got with my Pull
 Requests till now.

  I hope that this is the right place to discuss GSoC related ideas (I've
 idled at the scikit-learn irc channel for quite a few occasions, but I
 could not meet any core developer). I was browsing through the threads of
 last year, when I found this idea related to collaborative filtering (CF)
 quite interesting,
 http://sourceforge.net/mailarchive/message.php?msg_id=30725712 , though
 this was sadly not accepted.

  If the scikit-learn community is still enthusiastic about a recsys
 module with CF algorithms implemented, I would love this to be my GSoC
 proposal and we could discuss more about the 

Re: [Scikit-learn-general] Suggestion to add author names/emails at the bottom of module documentations

2014-01-16 Thread Robert Layton
I agree with Vlad.
Further, if there is documentation or a module that none of the active
developers can touch (due to complexity or lack of expertise), the
preference has generally been to move to remove it from scikit-learn.


On 17 January 2014 05:12, Vlad Niculae zephy...@gmail.com wrote:

 I would rather have this sorted out through the github issue tracker.

 I don't think it's a good idea to encourage users to e-mail individual
 developers. Someone else could have the expertise and do the change
 confidently.

 My 2c,
 Vlad

 On Thu Jan 16 18:12:05 2014, Issam wrote:
  Hi scikit-learn editors,
 
  Any documentation can have mistakes, but it's important to address them
  quickly and efficiently. One plausible way is to contact the author of
  an  erroneous text to have him make proper changes. But, for all I know,
  scikit-learn's documentation
  lacks authors' information; which could otherwise help users or experts
  in the field to readily request to address false information. Clearly,
  they can address such issues through scikit's mailing list, but not many
  people are willing to take that extra mile
  of looking for the mailing list address, waiting for approval, and
  keeping track of the mailing list posts, especially professors who are
  rather busy with their research work.
 
  How about if one opens a PR to add the corresponding author's
  information at the bottom of each module 's documentation (just like in
  newspapers) ?
 
  Thank you! :) :)
 
 
 
 
 --
  CenturyLink Cloud: The Leader in Enterprise Cloud Services.
  Learn Why More Businesses Are Choosing CenturyLink Cloud For
  Critical Workloads, Development Environments  Everything In Between.
  Get a Quote or Start a Free Trial Today.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from 2011-08-19 (key id: 54BA8735)
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Kyle Kastner
The other thing to keep mind an ideal solution would be compatible with
Pipeline() - it would be nice to be able to use it there, which is one of
the reasons a different signature for the predict() method is an issue.

Hopefully something can be figured out, as there is a lot interest in CF
algorithms, and a large majority of the algorithmic work (at least for the
CF algorithm I looked at) is already present in the NMF code.


On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:

 Yes indeed, getting two parameters for predict would be specific to CF.
 That was the most obvious idea that came to my mind. I would like to hear
 other's opinions also regarding the API, and the feasibility of such a
 project.


 On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner kastnerk...@gmail.comwrote:

 @Manoj
 The predict stage taking 2 parameters is what I was talking about - are
 there any other estimators that need anything more than a single matrix to
 do a prediction? I do not recall any - this would be something particular
 to CF. Maybe you could recast it as a matrix with alternating rows of
 item,rating but that is still a particular for CF.

 Whether that is OK as far sklearn's API is concerned is not for me to
 decide. I would also expect it to be closely tied with DictVectorizer or
 something like it, probably more so than most other algorithms (though this
 is not a big deal IMO) to get categorical labels.

 @nmuralid
 I agree totally - last number I saw was that the typical matrix for
 something like Amazon is 99% sparse? I don't remember where I read it
 though. Looking at crab, it seems like they are trying to do sklearn-style
 API specifically for collaborative filtering. Not sure where the name crab
 comes in, but it is definitely worth looking at.

 Kyle


 On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu 
 nmura...@masonlive.gmu.edu wrote:

  I agree  that sparse matrices need to be supported as one of the main
 properties inherent to the user/item rating matrix in recommender systems
 is its sparsity. This sparsity is what has given rise to such a large scale
 of research in the field. Hence this property would have to be taken
 advantage of because if not, since we have to deal with matrices,
 similarity calculations would have complexity through the roof (although
 there are ways to overcome this by using item-item cf techniques where
 similarity calculation is done offline but nevertheless is still
 expensive).

  Possibly solutions in my opinion:
1 Support dense and sparse matrices but I am not sure if such an
 implementation can be directly plugged into sklearn (because of the sparse
 matrix support.)

  2 Distributed recommender systems (just provide the ability for
 people to distribute their similarity calculations.) This can be done using
 MRJob a hadoop-streaming wrapper for python. This is also a current field
 of research and I'm sure if you look into it you will find quite a lot of
 literature on the topic.

  3 I am currently also trying to look into this library called
 scikit-crab which was started based upon a similar plan but I heard the
 developers are rewriting the library currently and it might not be open to
 the community for active development at present (not sure about this
 though). But I just mentioned it thinking maybe if you took a look at the
 code, you would get some more ideas about what improvements could be made.
 https://github.com/muricoca/crab

   --
 *From:* Kyle Kastner [kastnerk...@gmail.com]
 *Sent:* Wednesday, January 15, 2014 1:42 PM
 *To:* scikit-learn-general@lists.sourceforge.net
 *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems
 (from talking to Jake IIRC) is how to handle the missing values in the
 input array. You would either need a mask, or some kind of indexing system
 for describing which value goes where in the input matrix. Either way, this
 extra argument would be a requirement for CF, but not for the existing
 algorithms in sklearn.

  Maybe it would only operate on sparse arrays, and infer that the values
 which are missing are the ones to be imputed (recommended)? But not
 supporting dense arrays would basically be the opposite of other modules in
 sklearn, where dense input is default. Maybe someone can comment on this?

  I don't know how well this lines up with the existing API/functionality
 and the future directions there, but how to deal with the missing values is
 probably the primary concern for implementing CF algorithms in sklearn IMO.


 On Wed, Jan 15, 2014 at 12:07 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:

   Hello,

  First of all, thanks to the scikit-learn community for guiding new
 developers. I'm thankful for all the help that I've got with my Pull
 Requests till now.

  I hope that this is the right place to discuss GSoC related ideas
 (I've idled at the 

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Joel Nothman
`y` is by definition hidden at prediction time for supervised learning, so
I don't think your representation makes sense. But I see this as a
completion problem, not a supervised learning problem: the same data is
observed at training and predict time.

Isn't the following:
X = [[ham, spam], [ram, bam, tam]], and y = [[2, 3], [1, -3, 4]]

equivalent to [{'ham': 2, 'spam': 3}, {'ram': 2, 'bam': -3, 'tam': 4}]?

Via DictVectorizer, this becomes equivalent to a sparse COO matrix with:
col = [0, 1, 2, 3, 4]
row = [0, 0, 1, 1, 1]
data = [2, 3, 2, -3, 4]

As far as I can tell, this is a plain old sparse matrix, without a need for
an extra `y`. (Please convince me otherwise!)

There are still issues of whether this is in scikit-learn scope. For
example, does it make sense with sklearn's cross validation? Or will you
want to cross validate on both axes? Given that there is plenty of work to
be done that is well within scikit-learn's scope (prominent alternative
solutions and utilities for problems it already solves), I think this
extension of scope needs to be argued.


On 17 January 2014 09:24, Kyle Kastner kastnerk...@gmail.com wrote:

 The other thing to keep mind an ideal solution would be compatible with
 Pipeline() - it would be nice to be able to use it there, which is one of
 the reasons a different signature for the predict() method is an issue.

 Hopefully something can be figured out, as there is a lot interest in CF
 algorithms, and a large majority of the algorithmic work (at least for the
 CF algorithm I looked at) is already present in the NMF code.


 On Thu, Jan 16, 2014 at 1:09 PM, Manoj Kumar 
 manojkumarsivaraj...@gmail.com wrote:

 Yes indeed, getting two parameters for predict would be specific to CF.
 That was the most obvious idea that came to my mind. I would like to hear
 other's opinions also regarding the API, and the feasibility of such a
 project.


 On Thu, Jan 16, 2014 at 11:47 PM, Kyle Kastner kastnerk...@gmail.comwrote:

 @Manoj
 The predict stage taking 2 parameters is what I was talking about - are
 there any other estimators that need anything more than a single matrix to
 do a prediction? I do not recall any - this would be something particular
 to CF. Maybe you could recast it as a matrix with alternating rows of
 item,rating but that is still a particular for CF.

 Whether that is OK as far sklearn's API is concerned is not for me to
 decide. I would also expect it to be closely tied with DictVectorizer or
 something like it, probably more so than most other algorithms (though this
 is not a big deal IMO) to get categorical labels.

 @nmuralid
 I agree totally - last number I saw was that the typical matrix for
 something like Amazon is 99% sparse? I don't remember where I read it
 though. Looking at crab, it seems like they are trying to do sklearn-style
 API specifically for collaborative filtering. Not sure where the name crab
 comes in, but it is definitely worth looking at.

 Kyle


 On Thu, Jan 16, 2014 at 11:17 AM, nmura...@masonlive.gmu.edu 
 nmura...@masonlive.gmu.edu wrote:

  I agree  that sparse matrices need to be supported as one of the main
 properties inherent to the user/item rating matrix in recommender systems
 is its sparsity. This sparsity is what has given rise to such a large scale
 of research in the field. Hence this property would have to be taken
 advantage of because if not, since we have to deal with matrices,
 similarity calculations would have complexity through the roof (although
 there are ways to overcome this by using item-item cf techniques where
 similarity calculation is done offline but nevertheless is still
 expensive).

  Possibly solutions in my opinion:
1 Support dense and sparse matrices but I am not sure if such an
 implementation can be directly plugged into sklearn (because of the sparse
 matrix support.)

  2 Distributed recommender systems (just provide the ability for
 people to distribute their similarity calculations.) This can be done using
 MRJob a hadoop-streaming wrapper for python. This is also a current field
 of research and I'm sure if you look into it you will find quite a lot of
 literature on the topic.

  3 I am currently also trying to look into this library called
 scikit-crab which was started based upon a similar plan but I heard the
 developers are rewriting the library currently and it might not be open to
 the community for active development at present (not sure about this
 though). But I just mentioned it thinking maybe if you took a look at the
 code, you would get some more ideas about what improvements could be made.
 https://github.com/muricoca/crab

   --
 *From:* Kyle Kastner [kastnerk...@gmail.com]
 *Sent:* Wednesday, January 15, 2014 1:42 PM
 *To:* scikit-learn-general@lists.sourceforge.net
 *Subject:* Re: [Scikit-learn-general] Google Summer of Code 2014

I looked into this once upon a time, and one of the key problems
 (from talking to Jake IIRC) is how to 

Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Olivier Grisel
2014/1/16 Joel Nothman joel.noth...@gmail.com:
 There are still issues of whether this is in scikit-learn scope. For
 example, does it make sense with sklearn's cross validation? Or will you
 want to cross validate on both axes? Given that there is plenty of work to
 be done that is well within scikit-learn's scope (prominent alternative
 solutions and utilities for problems it already solves), I think this
 extension of scope needs to be argued.

+1

I would first focus on generic matrix factorization / completion
estimators as unsupervised estimators (using the standard model.fit(X)
API with a scipy sparse X). Then real a CF system could leverage such
building blocks to build its features in 3rd party libraries that
would build upon scikit-learn but would provide the domain specific
recsys boilerplate.

-- 
Olivier

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Automated scikit-learn dev documentation build

2014-01-16 Thread Olivier Grisel
Hi all,

Jaques and I have recently been working on moving the dev
documentation build job out of Fabian's workstation to a server on the
public Rackspace Cloud.

The deployment of the documentation build server is now fully
automated thanks to this script and configuration:

  https://github.com/scikit-learn/sklearn-docbuilder

It uses Apache Libcloud to contact Rackspace, find the IP address of
the docbuilder server or start a Ubuntu 12.04 VM if no existing VM can
be found then uses SaltStack to configure the server:

- create a non-root user sklearn
- install a pair of ssh keys suitable for sourceforge upload
- install all build dependencies using system packages
- create a venv to install additional python packages (sphinx) in more
recent versions
- clone the scikit-learn git repo
- configure a cron job that updates the git repo, build sklearn, build
the doc and rsync it to sourceforge every hour.

The README.md has more info. If you are curious the salt configuration is here:

  https://github.com/scikit-learn/sklearn-docbuilder/tree/master/srv/salt

The provisioning script is here:

  https://github.com/scikit-learn/sklearn-docbuilder/blob/master/docbuilder.py

If you are interested in being a co-admin for this, please feel free
to ask me the rackspace credentials and the private SSH key.

Thanks again to rackspace for the free Rackspace Cloud account. I will
issue a pull request to thank them in the doc.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Google Summer of Code 2014

2014-01-16 Thread Manoj Kumar
Thanks everyone for your quick responses.

1. Could someone point me to a list of GSoC ideas this year?
2. Is it okay, if I take up projects related to ideas, that have not yet
been implemented. For example, a quick search tells me Improving GMM has
not been implemented.

Thanks.
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general