[jira] Issue Comment Edited: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-26 Thread Jeff Eastman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861194#action_12861194
 ] 

Jeff Eastman edited comment on MAHOUT-297 at 4/26/10 9:15 PM:
--

I don't understand why the constructors for Canopy and KMeans Cluster were 
modified to override the given center vector types, as in:

{noformat}
   public Canopy(Vector point, int canopyId) {
 this.setId(canopyId);
-this.setCenter(point.clone());
-this.setPointTotal(point.clone());
+this.setCenter(new RandomAccessSparseVector(point.clone()));
+this.setPointTotal(getCenter().clone());
 this.setNumPoints(1);
   }
{noformat}

I can appreciate it might be a performance fix in some situations but forcing 
the center and total to be another type than that of the argument strikes me as 
bad practice. With input vectors of arbitrary type, shouldn't the clusters 
honor the contract to do their math over that type?

I'm -1 on this part of the patch.

  was (Author: jeastman):
I don't understand why the constructors for Canopy and KMeans Cluster were 
modified to override the given center vector types, as in:

   public Canopy(Vector point, int canopyId) {
 this.setId(canopyId);
-this.setCenter(point.clone());
-this.setPointTotal(point.clone());
+this.setCenter(new RandomAccessSparseVector(point.clone()));
+this.setPointTotal(getCenter().clone());
 this.setNumPoints(1);
   }

I can appreciate it might be a performance fix in some situations but forcing 
the center and total to be another type than that of the argument strikes me as 
bad practice. With input vectors of arbitrary type, shouldn't the clusters 
honor the contract to do their math over that type?

I'm -1 on this part of the patch.
  
> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
> MAHOUT-297.patch, MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-26 Thread Jeff Eastman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861194#action_12861194
 ] 

Jeff Eastman commented on MAHOUT-297:
-

I don't understand why the constructors for Canopy and KMeans Cluster were 
modified to override the given center vector types, as in:

   public Canopy(Vector point, int canopyId) {
 this.setId(canopyId);
-this.setCenter(point.clone());
-this.setPointTotal(point.clone());
+this.setCenter(new RandomAccessSparseVector(point.clone()));
+this.setPointTotal(getCenter().clone());
 this.setNumPoints(1);
   }

I can appreciate it might be a performance fix in some situations but forcing 
the center and total to be another type than that of the argument strikes me as 
bad practice. With input vectors of arbitrary type, shouldn't the clusters 
honor the contract to do their math over that type?

I'm -1 on this part of the patch.

> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
> MAHOUT-297.patch, MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861171#action_12861171
 ] 

Ted Dunning commented on MAHOUT-305:


{quote}
Ted says he ... doesn't like throwing out the low-count co-occurrences.

I agree, in the sense that low-count doesn't mean unimportant. It's something 
that LLR that figures out whether it's meaningless or contains a lot of info.
{quote}

Close.  But I would go further and say that on average individual data records 
that are high count are generally less useful than those with low counts and 
they are quadratically more expensive to deal with.  That combination of much 
higher expense and considerably lower value makes it seem to be a good idea to 
nuke (aka downsample) those records rather than lose the low count stuff.  

Dropping low count items in the combiner is even worse since there might have 
been quite a number scattered around that could have added up to interesting 
levels.



> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Richard Simon Just (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861159#action_12861159
 ] 

Richard Simon Just commented on MAHOUT-371:
---

Excellent! I haven't downloaded the latest MEAP version of MiA yet, so that 
would great. Not sure if it has changed much but will re-read the version I 
have and start looking at a more detailed design, before consulting mahout-dev.

> [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
> ---
>
> Key: MAHOUT-371
> URL: https://issues.apache.org/jira/browse/MAHOUT-371
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Richard Simon Just
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:i...@richardsimonjust.co.uk
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract: 
> During the Netflix Prize Challenge one of the most popular forms of 
> Recommender algorithm was that of Matrix Factorisation, in particular 
> Singular Value Decomposition (SVD). As such this proposal looks to implement 
> a distributed version of one of the most successful SVD-based recommender 
> algorithms from the Netflix competition. Namely, the SVD++ algorithm. 
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
> feedback[1]. That is to say that it is able to take into account not just 
> explicit user preferences, but also feedback such as, in the case of a 
> company like Netflix, whether a movie has been rented. Implicit feedback 
> assumes that the fact of there being some correlation between the user and 
> the item is more important that whether the correlation is positive or 
> negative. Implicit feedback would account for an item has being rated, but 
> not what the rating was.
> The implementation will include testing, in-depth documentation and a 
> demo/tutorial. If there is time, I will also look to developing the algorithm 
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
> of the SVD algorithm by taking into account temporal dynamics. Temporal 
> dynamics addresses the way user preferences in items and their behaviour in 
> how they rate items can change over time. According to [2] the gains in 
> accuracy implementing timeSVD++ are significantly bigger than the gains going 
> from SVD to SVD++. 
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into 
> factors, combined with regularisation and implicit feedback to predict how 
> much a user is likely to match an item. Factors are abstract categorises that 
> are created from comparing the data presented. Factor values are grades 
> saying how much each user/item is related to that category. For example with 
> the Netflix data a factor could loosely correspond to a movie genre, a 
> director or story line. The more factors used the more detailed the 
> categories are likely to become, and thus the more accurate the predictions 
> are likely to become. 
> Implicit feedback is based on the theory that a user having any sort of 
> relationship to an item is more important that whether they rated it, or what 
> rating they gave. The assumption is that even if a user does not like an 
> item, or has not rated it, the very fact that they choose to have some 
> interaction with it indicates something about their preferences. In the 
> Netflix case this would be represented by whether a user has rated a movie or 
> not,  it could also take into account the user's rental history. 
> As well as the actual implementation of the code the project has two other 
> deliverable focuses. The readability, documentation, testing of the code; and 
> a full tutorial and demo of the code. It is felt that without these things 
> the implementation is not really complete or fully usable. 
> The recommender consists of 5 main parts. The job class that runs the code, 
> an input conversion section, a training section, a re-trainer section and a 
> prediction section. A brief overview of these sections follows.
> The SVD++  Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on 
> Hadoop by implementing the Tool interface through AbstractJob. This class 
> will parse any user arguments and setup the jobs that will run the algorithm 
> on Map/Reduce, much in the same way Mahout's other distributed recommenders, 
> do such as RecommenderJob.
> In

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861154#action_12861154
 ] 

Sean Owen commented on MAHOUT-371:
--

Your schedule maps it out well. In the next month, get to where you can start 
writing code.

I will send you Chapter 6 of Mahout in Action which explains pretty well the 
structure of one current distributed recommender implementation. It should walk 
you through most of your setup steps including setting up Hadoop.

I'd come up with a mental model of how you'll set up the computation Hadoop. 
this is the tricky part and worth talking on mahout-dev.

> [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
> ---
>
> Key: MAHOUT-371
> URL: https://issues.apache.org/jira/browse/MAHOUT-371
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Richard Simon Just
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:i...@richardsimonjust.co.uk
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract: 
> During the Netflix Prize Challenge one of the most popular forms of 
> Recommender algorithm was that of Matrix Factorisation, in particular 
> Singular Value Decomposition (SVD). As such this proposal looks to implement 
> a distributed version of one of the most successful SVD-based recommender 
> algorithms from the Netflix competition. Namely, the SVD++ algorithm. 
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
> feedback[1]. That is to say that it is able to take into account not just 
> explicit user preferences, but also feedback such as, in the case of a 
> company like Netflix, whether a movie has been rented. Implicit feedback 
> assumes that the fact of there being some correlation between the user and 
> the item is more important that whether the correlation is positive or 
> negative. Implicit feedback would account for an item has being rated, but 
> not what the rating was.
> The implementation will include testing, in-depth documentation and a 
> demo/tutorial. If there is time, I will also look to developing the algorithm 
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
> of the SVD algorithm by taking into account temporal dynamics. Temporal 
> dynamics addresses the way user preferences in items and their behaviour in 
> how they rate items can change over time. According to [2] the gains in 
> accuracy implementing timeSVD++ are significantly bigger than the gains going 
> from SVD to SVD++. 
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into 
> factors, combined with regularisation and implicit feedback to predict how 
> much a user is likely to match an item. Factors are abstract categorises that 
> are created from comparing the data presented. Factor values are grades 
> saying how much each user/item is related to that category. For example with 
> the Netflix data a factor could loosely correspond to a movie genre, a 
> director or story line. The more factors used the more detailed the 
> categories are likely to become, and thus the more accurate the predictions 
> are likely to become. 
> Implicit feedback is based on the theory that a user having any sort of 
> relationship to an item is more important that whether they rated it, or what 
> rating they gave. The assumption is that even if a user does not like an 
> item, or has not rated it, the very fact that they choose to have some 
> interaction with it indicates something about their preferences. In the 
> Netflix case this would be represented by whether a user has rated a movie or 
> not,  it could also take into account the user's rental history. 
> As well as the actual implementation of the code the project has two other 
> deliverable focuses. The readability, documentation, testing of the code; and 
> a full tutorial and demo of the code. It is felt that without these things 
> the implementation is not really complete or fully usable. 
> The recommender consists of 5 main parts. The job class that runs the code, 
> an input conversion section, a training section, a re-trainer section and a 
> prediction section. A brief overview of these sections follows.
> The SVD++  Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on 
> Hadoop by implementing the Tool interface through AbstractJob. Th

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Richard Simon Just (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861149#action_12861149
 ] 

Richard Simon Just commented on MAHOUT-371:
---

Awesome! I won't lie, I'm super excited! Thank you!

Oh you're practically down the road. I'd love to meet up at some point after my 
exams.

In the meantime, where do we go from here?
Cheers
RSJ

> [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
> ---
>
> Key: MAHOUT-371
> URL: https://issues.apache.org/jira/browse/MAHOUT-371
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Richard Simon Just
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:i...@richardsimonjust.co.uk
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract: 
> During the Netflix Prize Challenge one of the most popular forms of 
> Recommender algorithm was that of Matrix Factorisation, in particular 
> Singular Value Decomposition (SVD). As such this proposal looks to implement 
> a distributed version of one of the most successful SVD-based recommender 
> algorithms from the Netflix competition. Namely, the SVD++ algorithm. 
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
> feedback[1]. That is to say that it is able to take into account not just 
> explicit user preferences, but also feedback such as, in the case of a 
> company like Netflix, whether a movie has been rented. Implicit feedback 
> assumes that the fact of there being some correlation between the user and 
> the item is more important that whether the correlation is positive or 
> negative. Implicit feedback would account for an item has being rated, but 
> not what the rating was.
> The implementation will include testing, in-depth documentation and a 
> demo/tutorial. If there is time, I will also look to developing the algorithm 
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
> of the SVD algorithm by taking into account temporal dynamics. Temporal 
> dynamics addresses the way user preferences in items and their behaviour in 
> how they rate items can change over time. According to [2] the gains in 
> accuracy implementing timeSVD++ are significantly bigger than the gains going 
> from SVD to SVD++. 
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into 
> factors, combined with regularisation and implicit feedback to predict how 
> much a user is likely to match an item. Factors are abstract categorises that 
> are created from comparing the data presented. Factor values are grades 
> saying how much each user/item is related to that category. For example with 
> the Netflix data a factor could loosely correspond to a movie genre, a 
> director or story line. The more factors used the more detailed the 
> categories are likely to become, and thus the more accurate the predictions 
> are likely to become. 
> Implicit feedback is based on the theory that a user having any sort of 
> relationship to an item is more important that whether they rated it, or what 
> rating they gave. The assumption is that even if a user does not like an 
> item, or has not rated it, the very fact that they choose to have some 
> interaction with it indicates something about their preferences. In the 
> Netflix case this would be represented by whether a user has rated a movie or 
> not,  it could also take into account the user's rental history. 
> As well as the actual implementation of the code the project has two other 
> deliverable focuses. The readability, documentation, testing of the code; and 
> a full tutorial and demo of the code. It is felt that without these things 
> the implementation is not really complete or fully usable. 
> The recommender consists of 5 main parts. The job class that runs the code, 
> an input conversion section, a training section, a re-trainer section and a 
> prediction section. A brief overview of these sections follows.
> The SVD++  Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on 
> Hadoop by implementing the Tool interface through AbstractJob. This class 
> will parse any user arguments and setup the jobs that will run the algorithm 
> on Map/Reduce, much in the same way Mahout's other distributed recommenders, 
> do such as RecommenderJob.
> Input Mapper/Reducer Classes:
> The Mapp

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861144#action_12861144
 ] 

Sean Owen commented on MAHOUT-305:
--

Ted says he likes LLR, and doesn't like throwing out the low-count 
co-occurrences.

I agree, in the sense that low-count doesn't mean unimportant. It's something 
that LLR that figures out whether it's meaningless or contains a lot of info.

I think the sentiment reduces to, this would be a better system if LLRs were 
used instead of simple co-occurrence counts as weights, which is right. It 
would involve the whole step of computing all item-item LLRs right? which can 
be done.

My vision is to start with this simple system and work towards generalizing, so 
I can stick in a different means of generation the weights matrix, and 
different strategies for pruning..

So if generalizing to create a second, LLR-based system comes next, does it 
make sense to leave in the dumb co-occurrence based system as well? meh, 
probably for now. So what's the appropriately dumb pruning method for 
co-occurrence counts?

Since pruning a co-occurrence means setting its count to 0, it made sense to me 
that the error from pruning is minimized by pruning those with lowest counts 
(already closest to 0).

(By the way I meant 'running well' in the sense of quickly; I haven't run much 
evaluation of the output yet.)

> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Ted Dunning
On Mon, Apr 26, 2010 at 1:46 PM, Sean Owen (JIRA)  wrote:

> Ted how do you like to pick which items to pay attention to for
> co-occurrence? I'm looking for something simple to start.
>

LLR is my standard answer.


>
> Though it's running pretty well (well a lot better than it was) at the
> moment, with the aggressive combiner chucking out low-frequency
> co-occurrence.
>

That still worries me.  I would expect that you would get better by
down-sampling high frequency items.


[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861095#action_12861095
 ] 

Sean Owen commented on MAHOUT-305:
--

I'm about to commit another pass at this since it's getting better and better, 
that would formally collapse the two together per this thread. Then some more 
ideas can be added in that I didn't already.

Ted how do you like to pick which items to pay attention to for co-occurrence? 
I'm looking for something simple to start.

Though it's running pretty well (well a lot better than it was) at the moment, 
with the aggressive combiner chucking out low-frequency co-occurrence.

> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861087#action_12861087
 ] 

Sean Owen commented on MAHOUT-371:
--

Looks like this was accept to GSoC, nice. Let the warmup period begin.
Incidentally I am located in the UK too -- London.

> [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop
> ---
>
> Key: MAHOUT-371
> URL: https://issues.apache.org/jira/browse/MAHOUT-371
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: Richard Simon Just
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++ 
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:i...@richardsimonjust.co.uk
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract: 
> During the Netflix Prize Challenge one of the most popular forms of 
> Recommender algorithm was that of Matrix Factorisation, in particular 
> Singular Value Decomposition (SVD). As such this proposal looks to implement 
> a distributed version of one of the most successful SVD-based recommender 
> algorithms from the Netflix competition. Namely, the SVD++ algorithm. 
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit 
> feedback[1]. That is to say that it is able to take into account not just 
> explicit user preferences, but also feedback such as, in the case of a 
> company like Netflix, whether a movie has been rented. Implicit feedback 
> assumes that the fact of there being some correlation between the user and 
> the item is more important that whether the correlation is positive or 
> negative. Implicit feedback would account for an item has being rated, but 
> not what the rating was.
> The implementation will include testing, in-depth documentation and a 
> demo/tutorial. If there is time, I will also look to developing the algorithm 
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results 
> of the SVD algorithm by taking into account temporal dynamics. Temporal 
> dynamics addresses the way user preferences in items and their behaviour in 
> how they rate items can change over time. According to [2] the gains in 
> accuracy implementing timeSVD++ are significantly bigger than the gains going 
> from SVD to SVD++. 
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into 
> factors, combined with regularisation and implicit feedback to predict how 
> much a user is likely to match an item. Factors are abstract categorises that 
> are created from comparing the data presented. Factor values are grades 
> saying how much each user/item is related to that category. For example with 
> the Netflix data a factor could loosely correspond to a movie genre, a 
> director or story line. The more factors used the more detailed the 
> categories are likely to become, and thus the more accurate the predictions 
> are likely to become. 
> Implicit feedback is based on the theory that a user having any sort of 
> relationship to an item is more important that whether they rated it, or what 
> rating they gave. The assumption is that even if a user does not like an 
> item, or has not rated it, the very fact that they choose to have some 
> interaction with it indicates something about their preferences. In the 
> Netflix case this would be represented by whether a user has rated a movie or 
> not,  it could also take into account the user's rental history. 
> As well as the actual implementation of the code the project has two other 
> deliverable focuses. The readability, documentation, testing of the code; and 
> a full tutorial and demo of the code. It is felt that without these things 
> the implementation is not really complete or fully usable. 
> The recommender consists of 5 main parts. The job class that runs the code, 
> an input conversion section, a training section, a re-trainer section and a 
> prediction section. A brief overview of these sections follows.
> The SVD++  Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on 
> Hadoop by implementing the Tool interface through AbstractJob. This class 
> will parse any user arguments and setup the jobs that will run the algorithm 
> on Map/Reduce, much in the same way Mahout's other distributed recommenders, 
> do such as RecommenderJob.
> Input Mapper/Reducer Classes:
> The Mapper will take the input data and convert it to key value pairs in the 
> form of a hadoop Wri

Re: [GSOC] Congrats to all students

2010-04-26 Thread Sisir Koppaka
Thanks everyone!

This is a fantastic opportunity, and I'll try to make the best of this for
myself, as well as Mahout. Hopefully, we'll have a great compilation of deep
learning networks within the next few releases.

BTW, congrats to everyone on Mahout becoming a TLP!

On Tue, Apr 27, 2010 at 1:13 AM, Grant Ingersoll wrote:

> Looks like student GSOC announcements are up (
> http://socghop.appspot.com/gsoc/program/list_projects/google/gsoc2010).
>  Mahout got quite a few projects (5) accepted this year, which is a true
> credit to the ASF, Mahout, the mentors, and most of all the students!  We
> had a good number of very high quality student proposals for Mahout this
> year and it was very difficult to choose.  Of the ones selected, I think
> they all bode well for the future of Mahout and the students.
>
> For those who didn't make the cut, I know it's small consolation, but I
> would encourage you all to stay involved in open source, if not Mahout
> specifically.  We'd certainly love to see you contributing here as many of
> you had very good ideas.
>
> At any rate, for everyone, keep an eye out on the Mahout project, as you
> should be seeing lots of exciting features coming to Mahout soon in the form
> of scalable Neural Networks, Restricted Boltzmann Machines (recommenders),
> SVD-based recommenders, EigenCuts Spectral Clustering and Support Vector
> Machines (SVM)!
>
> Should be an exciting summer!
>
> -Grant




-- 
SK


[GSOC] Congrats to all students

2010-04-26 Thread Grant Ingersoll
Looks like student GSOC announcements are up 
(http://socghop.appspot.com/gsoc/program/list_projects/google/gsoc2010).  
Mahout got quite a few projects (5) accepted this year, which is a true credit 
to the ASF, Mahout, the mentors, and most of all the students!  We had a good 
number of very high quality student proposals for Mahout this year and it was 
very difficult to choose.  Of the ones selected, I think they all bode well for 
the future of Mahout and the students.

For those who didn't make the cut, I know it's small consolation, but I would 
encourage you all to stay involved in open source, if not Mahout specifically.  
We'd certainly love to see you contributing here as many of you had very good 
ideas.

At any rate, for everyone, keep an eye out on the Mahout project, as you should 
be seeing lots of exciting features coming to Mahout soon in the form of 
scalable Neural Networks, Restricted Boltzmann Machines (recommenders), 
SVD-based recommenders, EigenCuts Spectral Clustering and Support Vector 
Machines (SVM)!

Should be an exciting summer!

-Grant

[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-26 Thread Jeff Eastman (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860981#action_12860981
 ] 

Jeff Eastman commented on MAHOUT-236:
-

Ok, the above patch was committed on the 21st and is now in trunk. What remains 
for this issue is to complete the CDbw calculations from the now-computed 
representative points. Robin, do you have any implementation code for this or 
should I start from scratch?

> Cluster Evaluation Tools
> 
>
> Key: MAHOUT-236
> URL: https://issues.apache.org/jira/browse/MAHOUT-236
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Reporter: Grant Ingersoll
> Attachments: MAHOUT-236.patch, MAHOUT-236.patch, MAHOUT-236.patch, 
> MAHOUT-236.patch
>
>
> Per 
> http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6,
>  it would be great to have some utilities to help evaluate the effectiveness 
> of clustering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-385) Unify Vector Writables

2010-04-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-385:
-

Attachment: MAHOUT-385.patch

> Unify Vector Writables
> --
>
> Key: MAHOUT-385
> URL: https://issues.apache.org/jira/browse/MAHOUT-385
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.3
>Reporter: Sean Owen
>Priority: Minor
> Fix For: 0.4
>
> Attachments: MAHOUT-385.patch
>
>
> Per the mailing list thread, creating an issue to track patches and 
> discussion of unifying vector writables. The essence of my attempt will be 
> attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-385) Unify Vector Writables

2010-04-26 Thread Sean Owen (JIRA)
Unify Vector Writables
--

 Key: MAHOUT-385
 URL: https://issues.apache.org/jira/browse/MAHOUT-385
 Project: Mahout
  Issue Type: Improvement
  Components: Math
Affects Versions: 0.3
Reporter: Sean Owen
Priority: Minor
 Fix For: 0.4
 Attachments: MAHOUT-385.patch

Per the mailing list thread, creating an issue to track patches and discussion 
of unifying vector writables. The essence of my attempt will be attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860939#action_12860939
 ] 

Ankur commented on MAHOUT-305:
--

> But the answer is the partitioner ?
Yes

> Am I right that (item1, item2) ->count is all that's needed ?
Yes

> And why is the priority queue needed ...

You could use both a co-occurrence count (your favorite) and max number 
co-occurrent pair (say 1000). I have chosen a size 100. So for any given item 
the top-100 co-occurrent items (by count) would be output. Though the size is 
limited with this it still can cause explosion if there are very long 
histories. From netflix dataset recall the users who have rated more than 10K 
movies. So one way of taking care of them is to apply 'sessionization' i.e. 
output a co-occurrence pair only if they are part of a session or satisfy some 
other constraint. But that is not implemented yet. 

> TupleWritable ...
Not really. I have a specialized implementation for my own purpose using 
GenericWritable that wraps each object of TupleWritable.


> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860914#action_12860914
 ] 

Sean Owen commented on MAHOUT-305:
--

OK, I think I get the (item1,item2) -> (item2,count) part, at least, why it can 
be used in conjunction with a one-pass solution.

I wasn't sure how you guarantee that all (item1,item2) for item1 arrive at the 
same reducer. But the answer is the partitioner?

Then it works; I still think there is a need for lots of pruning and big 
combiner buffers after the map but that's different.

Am I right that (item1,item2) -> count is all that's needed?

> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860895#action_12860895
 ] 

Sean Owen commented on MAHOUT-305:
--

Most broadly, the input is item1->item2 pairs and the final output of the 
co-occurrence step is item1->((item2,count2),(item3,count3),...) -- rows of the 
co-occurrence matrix. I'm trying to do it in one pass, so need the reducer 
keyed by item1 instead of a pair. So I generate item1->(item2,count) in the 
mapper, combine, and then reduce to the vector in one go.


But OK you're suggesting to unpack that, and try to first output 
(item1,item2)->count, and from there a second stage can generate the rows.

In ItemSimilarityEstimator, the map seems to emit (item1,item2)->(item2,count). 
OK, so I understood the 'redundant' item2 in the key to be the secondary sort 
trick, so you can efficiently sum up counts and emit the final 
(item1,item2)->count. But then why is the value (item2,count) and not just 
count?

And why is a priority queue needed? Is it just used to store and emit only the 
top item-item pairs by count? That makes sense, though you've made the size 20, 
isn't that too small? or is it that the user must set this to a reasonable 
size. Or you could scrap the priority queue and filter based on a count cutoff. 
I'm fond of culling co-occurrence of 1 and keeping everything else. No queue 
needed for that though it doesn't cap the size of the resulting matrix.




What worries me is the size of the map output. Spilling an (item1,item2) pair 
for every co-occurrence is absolutely massive. With P preferences, U users, and 
I items, you're spilling about U*(P/U)^2 pairs = P^2/U. With a billion 
preferences that's getting easily into quadrillions.

Now of course the combiner, in theory, prevents almost all of this from 
spilling to disk. It sums up counts, so the number of pairs output is more on 
the order of I^2. In practice, I think the combiner doesn't have enough memory. 
Before it has to spill its queue through the combiner, it rarely tallies up the 
same item-item cooccurrence more than once. On a smallish data set I find the 
'hit rate' about 10% in this regards, even with io.sort.mb increased from a 
healthy 200MB to 1000MB.

And this is what's killing the job, I think, emitting so many low-count pairs. 
So that's why I was trying to be very aggressive in the combiner in throwing 
out data, and maybe need to do more. And being super-aggressive can mean 
capping the size of that intermediate map quite a bit more. And then that also 
kind of addresses the scalability bottleneck issue, and enables this to happen 
in one go anyway.

Or perhaps I miss why emitting pairs with count is actually going to be very 
scalable. It's very attractive after the map phase, but it's the spill after 
the map that's the problem I think.


TupleWritable is copied and paste from Hadoop right?

> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860882#action_12860882
 ] 

Ankur commented on MAHOUT-305:
--

CooccurrenceCombiner caches items internally and increments counts whenever it 
sees a new value. This might lead to memory issues with some real big datasets. 
Moreover, for every (item-id, count)  cached, a new object is created to apply 
a simple procedure. Looks an overkill to me.

With the secondary sort (item1, item2)  pairs are already sorted so that for 
each key (item1) all the (item1, item2) pairs appear before (item1, item3) 
assuming item2 < item3. With this we simple increment the count each time we 
see item2 and put the (item2, count) entry into a priority queue as soon as we 
see item3 or something else. The size of the priority queue can be limited to 
N.  Check out ItemSimilarityEstimator.java.

Agreed we need better facilities for pruning, something like support-count (any 
other?).

About merging, I feel CooccurrenceCombiner would be better with secondary sort. 
Also it will be good if we can retain TupleWritable for future use. Other than 
these I have no issues with throwing away code under 
o.a.m.cf.taste.hadoop.cooccurrence

> Combine both cooccurrence-based CF M/R jobs
> ---
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Ankur
>Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Fwd: announcing new TLPs [was: ASF Board Meeting Summary - April 21, 2010 - new TLP reporting schedule?]

2010-04-26 Thread Sean Owen
Here's my suggested boilerplate -- see below and please suggest edits
if desired. There's a 150 word limit.

Apache Mahout provides scalable implementations of machine learning
algorithms on top of Apache Hadoop. It offers collaborative filtering,
clustering, classification algorithms and more. Begun as a subproject
of Lucene in 2008, Mahout's team of nearly a dozen contributors is now
actively working towards release 0.4. It has become an Apache
Top-Level Project in April 26.

-- Forwarded message --
From: Sally Khudairi 
Date: Mon, Apr 26, 2010 at 12:54 AM
Subject: announcing new TLPs [was: ASF Board Meeting Summary - April
21, 2010 - new TLP reporting schedule?]
To: "Chris A (388J)Mattmann" ,
sro...@apache.org, a...@apache.org, zw...@apache.org, mas...@apache.org,
st...@apache.org
Cc: Apache Board , ASF Marketing & Publicity




Hello new TLPs!
Welcome to the masterlist on apache.org ;-)
The fact that we've got 6 projects graduating from the Incubator is A
Big Thing in the world of ASF Marketing & Publicity. As such, I'd like
to issue a press release announcing you to the world.
Can you please send me "boilerplate" copy describing your Project? I'm
including an example below that's amalgamated from the recent Apache
Cassandra announcement.
If I can get these from you by Tuesday at 5PM ET, we can issue the
announcement on Wednesday. Otherwise, we'll go live the following
Tuesday, 4 May.
Thanks in advance for this. Feel free to ping us if you need anything!
Warm regards,
Sally
(VP, Marketing & Publicity)

About Apache Cassandra

Apache Cassandra is an advanced, second-generation "NoSQL" distributed
data store that has a shared-nothing architecture. The Cassandra
decentralized model provides massive scalability, and is highly
available with no single point of failure even under the worst
scenarios. Originally developed at Facebook and submitted to the ASF
Incubator in 2009, Cassandra graduated as a Top-Level Apache Project
in February 2010, added more than a half-dozen new committers, and is
deployed by dozens of high-profile users such as Cisco WebEx,
Cloudkick, Digg, Facebook, Rackspace, Reddit, and Twitter, among
others.