Thanks very much for you quick - I really appreciate it!

My follow-up question is an attempt to better understand this choice
of implementation.

To take a concrete example, let's suppose that we have a system with 4
users, so item vectors are 4 dimensional and we have the following 5
vectors (I realize this is completely pathological example, but bear
with me). We have:

v1=[1.0, 1.0, 1.0, 0]
v2=[0.0, 1.0, 1.0, 0]
v3=[1.0, 0, 0, 0]
v4=[1.0,1.0,1.0,1.0]
v5=[1.0,1.0,1.0,1.0]

As I understand it, then, in the definition of
UncenteredCosineSimilarity, the Cosine Similarity between all of the
above would be 1.0.

Whereas in the "traditional" definition of Cosine Similarity, we'd
have the following correlation values:
cs(v1,v2)=0.816
cs(v1,v3)=0.577
cs(v1,v4)=0.866
cs(v4,v5)=1.0

Assuming I'm correct to this point, could you elaborate a little bit
on the rationale behind this choice? It would seem to me that, for
example, v1 and v2 are "more similar" (with 2 ratings in common) than
v1 and v3 (with just 1 rating in common). But obviously, you've
thought of already, so I'm curious to understand what I'm missing
here. I'm guessing it has something to do with your comment that the
calculation "is only going to make sense if the data indeed has a mean
of zero by nature."

Thanks for your time on this question and all of your efforts on
Mahout -- it's a great project.

best,
Francis

On Wed, Aug 22, 2012 at 5:11 PM, Sean Owen <[email protected]> wrote:
> The similarity is only defined over the dimensions where both series
> have a value, yes. So the denominator and numerator are equal in this
> case, giving a cosine of 1, which is right in the sense that in 1D
> space the cosine must be 1 or -1; two vectors can only point in
> exactly the same or exactly opposite directions. The calculation
> you're trying is equivalent to pretending that the dimensions with no
> value have value 0.0. That is only going to make sense if the data
> indeed has a mean of zero by nature.
>
> On Wed, Aug 22, 2012 at 12:27 PM, Francis Kelly <[email protected]> 
> wrote:
>> I'm writing with a question about the UncenteredCosineSimilarity
>> metric in Mahout 0.7 (in the context of a
>> GenericItemBasedRecommender).
>>
>> I'm getting a correlation value that I don't understand and I'm hoping
>> that someone can explain it to me.
>>
>> When I step through the code with a debugger, I find that when I'm
>> comparing two items in AbstractItemSimilarity.com at lines 265-266, we
>> have:
>>
>> PreferenceArray xPrefs = dataModel.getPreferencesForItem(itemID1);
>> PreferenceArray yPrefs = dataModel.getPreferencesForItem(itemID2);
>>
>> Upon inspection, we see the following vectors:
>>
>> xPrefs=GenericItemPreferenceArray[itemID:6,{1=0.31,3=0.49,4=0.62}]
>> yPrefs=GenericItemPreferenceArray[itemID:7,{2=0.43,4=0.21,5=0.52}].
>>
>> My understanding of the Cosine Similarity metric is that we take the
>> dot product of the vectors and divide it by the product of the
>> vectors' lengths. Assuming that's the case, we should have a
>> denominator of 0.62 * 0.21 = 0.13 because the above vectors only
>> overlap for userid=4. For the denominator -- and this is where the
>> code is confusing me -- I would assume that we would have the product
>> of the first vector length (sqrt(0.31^2 + 0.49^2 + 0.62^2) = 0.84) and
>> the second (sqrt(0.43^2 + 0.21^2 + 0.52^2)= 0.70).
>>
>> The code, however, appears only to consider the places the vectors
>> overlap (in other words, userid=4) to compute the lengths. Thus, when
>> I find myself at line 332:
>>
>> result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2);
>>
>> I find that sumX2 = 0.38 = 0.62^2 and sumY2 = 0.044 = 0.21^2. In other
>> words, sumX2 only considers the value for userid=4 and, sumY2 only
>> considers the value for userid=4 and not all values in each vector.
>>
>> And, indeed, following the code through the ultimate result it
>> produces is a correlation value of 1.0 for these vectors: 0.62*0.21 /
>> (sqrt(0.62^2)*sqrt(0.21^2)). I would have computed a correlation value
>> of 0.13/(0.84 * 0.70) = 0.21. If someone could explain the discrepancy
>> to me I'd be extremely grateful.
>>
>>
>> Thanks in advance,
>> Francis

Reply via email to