I'm writing with a question about the UncenteredCosineSimilarity
metric in Mahout 0.7 (in the context of a
GenericItemBasedRecommender).

I'm getting a correlation value that I don't understand and I'm hoping
that someone can explain it to me.

When I step through the code with a debugger, I find that when I'm
comparing two items in AbstractItemSimilarity.com at lines 265-266, we
have:

PreferenceArray xPrefs = dataModel.getPreferencesForItem(itemID1);
PreferenceArray yPrefs = dataModel.getPreferencesForItem(itemID2);

Upon inspection, we see the following vectors:

xPrefs=GenericItemPreferenceArray[itemID:6,{1=0.31,3=0.49,4=0.62}]
yPrefs=GenericItemPreferenceArray[itemID:7,{2=0.43,4=0.21,5=0.52}].

My understanding of the Cosine Similarity metric is that we take the
dot product of the vectors and divide it by the product of the
vectors' lengths. Assuming that's the case, we should have a
denominator of 0.62 * 0.21 = 0.13 because the above vectors only
overlap for userid=4. For the denominator -- and this is where the
code is confusing me -- I would assume that we would have the product
of the first vector length (sqrt(0.31^2 + 0.49^2 + 0.62^2) = 0.84) and
the second (sqrt(0.43^2 + 0.21^2 + 0.52^2)= 0.70).

The code, however, appears only to consider the places the vectors
overlap (in other words, userid=4) to compute the lengths. Thus, when
I find myself at line 332:

result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2);

I find that sumX2 = 0.38 = 0.62^2 and sumY2 = 0.044 = 0.21^2. In other
words, sumX2 only considers the value for userid=4 and, sumY2 only
considers the value for userid=4 and not all values in each vector.

And, indeed, following the code through the ultimate result it
produces is a correlation value of 1.0 for these vectors: 0.62*0.21 /
(sqrt(0.62^2)*sqrt(0.21^2)). I would have computed a correlation value
of 0.13/(0.84 * 0.70) = 0.21. If someone could explain the discrepancy
to me I'd be extremely grateful.


Thanks in advance,
Francis

Reply via email to