I think this is reasonable. Some suggestions: 1. Instead of using the total number of interactions as cell value, map the number to a 1-5 score based on histogram 2. Use item-item algorithm, which is supposed to work for sparse data. 3. I think the best algorithm to handle sparse data is the SVD algorithm. 4. Research shows that "accuracy" is not the only way to evaluate a CF use scenario, so you might want to show "explains" (ie, why the recommendations are made), etc.
Just my 2 cents. -daniel -- Daniel Xiaodan Zhou PhD student School of Information University of Michigan http://michiza.com On Aug 22, 2011, at 10:48 AM, Grant Ingersoll wrote: > I'm working on an example (well, examples) of using Mahout with the ASF > Public Data Set up on Amazon > (http://aws.amazon.com/datasets/7791434387204566) and I wanted to show how to > use the 3 "C's" (collab filtering, clustering, classification) with the data > set. Clustering and classification are pretty straight forward, but I'm > wondering about the setup around collaborative filtering. > > The motivation for recommendations is pretty straightforward: provide people > recs on emails that they might find useful based on what other people have > interacted with. The tricky part is I am not totally sure on a valid setup > of the problem. My current thinking is that I build up the rec. matrix based > on whether someone has interacted with (initiated/replied) a thread or not. > Thus, the columns are the thread ids and the rows are the users. Each cell > contains the count of the number of times user X has interacted with thread > Y. This feels to me like it is a stand in for that user's preference in that > if they are replying multiple times, they have an interest in that topic. I > have no idea if this will be effective or not, but it seems like it could be > interesting. Does it sound reasonable? I worry that even in a really large > data set as above it will simply be too sparse. > > Is there a better way to think about this from a strict collaborative > filtering context? In other words, I know I could do content-based > recommendations but that is not what I am after here. > > -Grant > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com >
