That's right, in the formulation you are referring to you are not
predicting the original input values, so you can't compare them with
RMSE or something.

To test precision / recall you hold out some of the top-rated items
(these are the "relevant results"), and see how many come back in the
recommendations. F1 is based on precision/recall. (For boolean data
you pick random input to hold out and the test is sort of flawed by
nature.)

nDCG captures more, as it scores higher for putting relevant results
higher. It's a somewhat better metric.

And so on for ROC -- should be fairly direct to apply once you know
what your positive / negative classes are supposed to be.

Mahout has some code for computing this sort of thing which you can
directly apply or lift and adapt.

On Fri, Jul 6, 2012 at 11:39 PM, Razon, Oren <[email protected]> wrote:
> Thanks Sean
> I've accidently continued this thread under the thread you opened, so I'm 
> moving back to my thread :)
>
> I will rephrase the question I've asked there.
> Let's say that as part of my held-out test my model find for user u2 
> connection to i1 has strength of 28.94 to i2 17.9 and to i3 4.5.
> The ranking itself which I have (hidden) is on scale of 1-5 (or even binary 
> 0\1 for an example).
>
> Now how could I estimate the ranking I gave for u2 if I only predicted the 
> connection strength he has with each item in order to rank the items while my 
> data is on different scale?
> In other words, the problem definition here is not prediction but ranking, 
> therefor I guess it should have different measures than prediction measures...
>
> Am I missing something?
>
> If familiar with precision \ recall \ ROC \ Lift and so on, but not sure I 
> understand how should I use them here.
>

Reply via email to