I'm working on an example (well, examples) of using Mahout with the ASF Public 
Data Set up on Amazon (http://aws.amazon.com/datasets/7791434387204566) and I 
wanted to show how to use the 3 "C's" (collab filtering, clustering, 
classification) with the data set.  Clustering and classification are pretty 
straight forward, but I'm wondering about the setup around collaborative 
filtering.

The motivation for recommendations is pretty straightforward:  provide people 
recs on emails that they might find useful based on what other people have 
interacted with.  The tricky part is I am not totally sure on a valid setup of 
the problem.  My current thinking is that I build up the rec. matrix based on 
whether someone has interacted with (initiated/replied) a thread or not.  Thus, 
the columns are the thread ids and the rows are the users.  Each cell contains 
the count of the number of times user X has interacted with thread Y.  This 
feels to me like it is a stand in for that user's preference in that if they 
are replying multiple times, they have an interest in that topic.  I have no 
idea if this will be effective or not, but it seems like it could be 
interesting.  Does it sound reasonable?  I worry that even in a really large 
data set as above it will simply be too sparse.

Is there a better way to think about this from a strict collaborative filtering 
context?  In other words, I know I could do content-based recommendations but 
that is not what I am after here.

-Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Reply via email to