Hi all, This questions stems from my use of the alternating least squares method in mahout, but errs on the theoretical side. If this is the wrong place for such a question, I apologize up front and would gladly direct my question to a more appropriate forum, as per your suggestions.
I have been thinking about how the distribution of rating data can influence a model built using ALS or any matrix factorization method for that matter. If I split my data into train and test sets, I can show good performance of the model on the train set. What might I expect given an uneven distribution of ratings? Imagine a situation where 50% of the ratings are 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do people pre-process their data to avoid skewed ratings distributions? How about the rating scale itself. For example, given [1:3] vs [1:10] ranges, in with the former, you've got a 1/3 chance of predicting the correct rating, say, while in the latter case it is a 1/10. Or, when is sparse too sparse, or can these questions even be answered because they are too system/context specific? Ultimately, I'm trying to figure out under what conditions one would look at a model and say "that is crap", pardon my language. Do any more experienced users have any advice to offer on when a factor model would break down or any of my points above? Thanks in advance, -Chloe
