Depends a bit on what you mean in the example here -- are the 0 values
observed values, or "null", a lack of an observed value?

If they are really 0, then the implementation will calculate the
values you listed. But I think you really mean the input is...

v1=[1.0, 1.0, 1.0,   ]
v2=[     , 1.0, 1.0,   ]
v3=[1.0,   ,   ,   ]
v4=[1.0,1.0,1.0,1.0]
v5=[1.0,1.0,1.0,1.0]

You can't assume the missing values are 0 in general. That may make
sense in some cases, but, for example, if your values are ratings on a
scale of 1 to 5 this amounts to assuming that all unrated items are
completely hated. The results will be nonsense.

(Really this isn't the right example to truly illustrate that, try a
dummy data set pretending that these are 1- to 5-star movie ratings
and I think you'll see the similarities that result from assuming
they're 0 don't make intuitive sense.

If you want this behavior, to assume null == 0, that's what the
PreferenceInferrer is for. You can inject any default you want, the
one that makes sense for the data set.

On Wed, Aug 22, 2012 at 5:31 PM, Francis Kelly <[email protected]> wrote:
> Thanks very much for you quick - I really appreciate it!
>
> My follow-up question is an attempt to better understand this choice
> of implementation.
>
> To take a concrete example, let's suppose that we have a system with 4
> users, so item vectors are 4 dimensional and we have the following 5
> vectors (I realize this is completely pathological example, but bear
> with me). We have:
>
> v1=[1.0, 1.0, 1.0, 0]
> v2=[0.0, 1.0, 1.0, 0]
> v3=[1.0, 0, 0, 0]
> v4=[1.0,1.0,1.0,1.0]
> v5=[1.0,1.0,1.0,1.0]
>
> As I understand it, then, in the definition of
> UncenteredCosineSimilarity, the Cosine Similarity between all of the
> above would be 1.0.
>
> Whereas in the "traditional" definition of Cosine Similarity, we'd
> have the following correlation values:
> cs(v1,v2)=0.816
> cs(v1,v3)=0.577
> cs(v1,v4)=0.866
> cs(v4,v5)=1.0
>
> Assuming I'm correct to this point, could you elaborate a little bit
> on the rationale behind this choice? It would seem to me that, for
> example, v1 and v2 are "more similar" (with 2 ratings in common) than
> v1 and v3 (with just 1 rating in common). But obviously, you've
> thought of already, so I'm curious to understand what I'm missing
> here. I'm guessing it has something to do with your comment that the
> calculation "is only going to make sense if the data indeed has a mean
> of zero by nature."
>
> Thanks for your time on this question and all of your efforts on
> Mahout -- it's a great project.
>
> best,
> Francis
>
> On Wed, Aug 22, 2012 at 5:11 PM, Sean Owen <[email protected]> wrote:
>> The similarity is only defined over the dimensions where both series
>> have a value, yes. So the denominator and numerator are equal in this
>> case, giving a cosine of 1, which is right in the sense that in 1D
>> space the cosine must be 1 or -1; two vectors can only point in
>> exactly the same or exactly opposite directions. The calculation
>> you're trying is equivalent to pretending that the dimensions with no
>> value have value 0.0. That is only going to make sense if the data
>> indeed has a mean of zero by nature.
>>
>> On Wed, Aug 22, 2012 at 12:27 PM, Francis Kelly <[email protected]> 
>> wrote:
>>> I'm writing with a question about the UncenteredCosineSimilarity
>>> metric in Mahout 0.7 (in the context of a
>>> GenericItemBasedRecommender).
>>>
>>> I'm getting a correlation value that I don't understand and I'm hoping
>>> that someone can explain it to me.
>>>
>>> When I step through the code with a debugger, I find that when I'm
>>> comparing two items in AbstractItemSimilarity.com at lines 265-266, we
>>> have:
>>>
>>> PreferenceArray xPrefs = dataModel.getPreferencesForItem(itemID1);
>>> PreferenceArray yPrefs = dataModel.getPreferencesForItem(itemID2);
>>>
>>> Upon inspection, we see the following vectors:
>>>
>>> xPrefs=GenericItemPreferenceArray[itemID:6,{1=0.31,3=0.49,4=0.62}]
>>> yPrefs=GenericItemPreferenceArray[itemID:7,{2=0.43,4=0.21,5=0.52}].
>>>
>>> My understanding of the Cosine Similarity metric is that we take the
>>> dot product of the vectors and divide it by the product of the
>>> vectors' lengths. Assuming that's the case, we should have a
>>> denominator of 0.62 * 0.21 = 0.13 because the above vectors only
>>> overlap for userid=4. For the denominator -- and this is where the
>>> code is confusing me -- I would assume that we would have the product
>>> of the first vector length (sqrt(0.31^2 + 0.49^2 + 0.62^2) = 0.84) and
>>> the second (sqrt(0.43^2 + 0.21^2 + 0.52^2)= 0.70).
>>>
>>> The code, however, appears only to consider the places the vectors
>>> overlap (in other words, userid=4) to compute the lengths. Thus, when
>>> I find myself at line 332:
>>>
>>> result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2);
>>>
>>> I find that sumX2 = 0.38 = 0.62^2 and sumY2 = 0.044 = 0.21^2. In other
>>> words, sumX2 only considers the value for userid=4 and, sumY2 only
>>> considers the value for userid=4 and not all values in each vector.
>>>
>>> And, indeed, following the code through the ultimate result it
>>> produces is a correlation value of 1.0 for these vectors: 0.62*0.21 /
>>> (sqrt(0.62^2)*sqrt(0.21^2)). I would have computed a correlation value
>>> of 0.13/(0.84 * 0.70) = 0.21. If someone could explain the discrepancy
>>> to me I'd be extremely grateful.
>>>
>>>
>>> Thanks in advance,
>>> Francis

Reply via email to