Re: Data distribution guidance for recommendation engines

Sean Owen Wed, 31 Jul 2013 23:58:25 -0700

On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo <chloe.gu...@gmail.com> wrote:
> If I split my data into train and test sets, I can show good performance of


Good performance according to what metric? it makes a lot of
difference whether you are talking about precision/recall or RMSE.

> the model on the train set. What might I expect given an uneven
> distribution of ratings? Imagine a situation where 50% of the ratings are
> 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do

In the general case, recommenders don't rate items at all, they rank
items. So this may not be a question that matters.

> about the rating scale itself. For example, given [1:3] vs [1:10] ranges,
> in with the former, you've got a 1/3 chance of predicting the correct
> rating, say, while in the latter case it is a 1/10.  Or, when is sparse too

Why do you say that... the recommender is not choosing ratings randomly.


> Ultimately, I'm trying to figure out under what conditions one would look
> at a model and say "that is crap", pardon my language. Do any more

You use evaluation metrics, which are imperfect, but about the best
you can do in the lab. If you're already doing that, you're doing
fine. This is true no matter what your input is like.

If your input is things like click count, then they will certainly be
mostly 1 and follow a power-law distribution. This is no problem but
you want to follow the 'implicit feedback' version of ALS, where you
are not trying to reconstruct the input but use the input as weights.

Re: Data distribution guidance for recommendation engines

Reply via email to