Hi all,

This questions stems from my use of the alternating least squares method in
mahout, but errs on the theoretical side. If this is the wrong place for
such a question, I apologize up front and would gladly direct my question
to a more appropriate forum, as per your suggestions.

I have been thinking about how the distribution of rating data can
influence a model built using ALS or any matrix factorization method for
that matter.

If I split my data into train and test sets, I can show good performance of
the model on the train set. What might I expect given an uneven
distribution of ratings? Imagine a situation where 50% of the ratings are
1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do
people pre-process their data to avoid skewed ratings distributions? How
about the rating scale itself. For example, given [1:3] vs [1:10] ranges,
in with the former, you've got a 1/3 chance of predicting the correct
rating, say, while in the latter case it is a 1/10.  Or, when is sparse too
sparse, or can these questions even be answered because they are too
system/context specific?

Ultimately, I'm trying to figure out under what conditions one would look
at a model and say "that is crap", pardon my language. Do any more
experienced users have any advice to offer on when a factor model would
break down or any of my points above?

Thanks in advance,
-Chloe

Reply via email to