I ran the factorizer on grouplens's 1 million rating movie dataset. I ran it for 5 iterations and chose number of features to be 10. I then constructed an SVDRecommender with the factorization, and generated all preference estimates for every user/movie pair.
For some reason, a good number of the user's end up with predictions of "0.0" for every movie, it seems to happen for every user greater than 2700-ish. Is it perhaps a problem due to factorization? I will see if I can reproduce the output, this seems like a bug and not expected behavior. On related note, is there a way to compute the full factorization, save the output, then later retrieve some rank-K approximation? It takes hours to run the factorizer and I feel it might be helpful to save factorizations for reuse. ----- Original Message ----- From: "Sebastian Schelter" <[email protected]> To: [email protected] Sent: Sunday, April 29, 2012 11:31:34 PM Subject: Re: How does SVDRecommender work in mahout? Daniel, You have to distinguish between explicit data (ratings from a predefined scale) and implicit data (counting how often you observed some behavior). For explicit data, you can't interpret missing values as zeros, because you simply don't know what the user would give as rating. In order to still use matrix factorization techniques, the decomposition has to be computed in a different way than with standard SVD approaches. The error function stays the same as with SVD (minimize the squared error of the product of the decomposed matrix), but the computation uses only the known entries. That's nothing Mahout specific, Mahout has implementations of the approaches described in http://sifter.org/~simon/journal/20061211.html and in http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.173.2797&rep=rep1&type=pdf For implicit data, the situation is different, because if you haven't observed a user conducting some behavior with an item, than your matrix should indeed have a 0 in that cell. The problem here is that the user might simply not have had the opportunity to interact with a lot of items, which means that you can't really 'trust' the zero entries as much as the other entries. There is a great paper that introduces a 'confidence' value for implicit data to solve this problem: www2.research.att.com/~yifanhu/PUB/cf.pdf Generally speaking, with this technique, the factorization uses the whole matrix, but 'favors' non-zero entries. --sebastian 2012/4/29 Sean Owen <[email protected]>: > They're implicitly zero as far as the math goes IIRC > > On Sun, Apr 29, 2012 at 10:45 PM, Daniel Quach <[email protected]> wrote: >> ah sorry, I meant in the context of the SVDRecommender. >> >> Your earlier email mentioned that the DataModel does NOT do any subtraction, >> nor add back in the end, ensuring the matrix remains sparse. Does that mean >> it inserts zero values?
