The similarity is only defined over the dimensions where both series
have a value, yes. So the denominator and numerator are equal in this
case, giving a cosine of 1, which is right in the sense that in 1D
space the cosine must be 1 or -1; two vectors can only point in
exactly the same or exactly opposite directions. The calculation
you're trying is equivalent to pretending that the dimensions with no
value have value 0.0. That is only going to make sense if the data
indeed has a mean of zero by nature.

On Wed, Aug 22, 2012 at 12:27 PM, Francis Kelly <[email protected]> wrote:
> I'm writing with a question about the UncenteredCosineSimilarity
> metric in Mahout 0.7 (in the context of a
> GenericItemBasedRecommender).
>
> I'm getting a correlation value that I don't understand and I'm hoping
> that someone can explain it to me.
>
> When I step through the code with a debugger, I find that when I'm
> comparing two items in AbstractItemSimilarity.com at lines 265-266, we
> have:
>
> PreferenceArray xPrefs = dataModel.getPreferencesForItem(itemID1);
> PreferenceArray yPrefs = dataModel.getPreferencesForItem(itemID2);
>
> Upon inspection, we see the following vectors:
>
> xPrefs=GenericItemPreferenceArray[itemID:6,{1=0.31,3=0.49,4=0.62}]
> yPrefs=GenericItemPreferenceArray[itemID:7,{2=0.43,4=0.21,5=0.52}].
>
> My understanding of the Cosine Similarity metric is that we take the
> dot product of the vectors and divide it by the product of the
> vectors' lengths. Assuming that's the case, we should have a
> denominator of 0.62 * 0.21 = 0.13 because the above vectors only
> overlap for userid=4. For the denominator -- and this is where the
> code is confusing me -- I would assume that we would have the product
> of the first vector length (sqrt(0.31^2 + 0.49^2 + 0.62^2) = 0.84) and
> the second (sqrt(0.43^2 + 0.21^2 + 0.52^2)= 0.70).
>
> The code, however, appears only to consider the places the vectors
> overlap (in other words, userid=4) to compute the lengths. Thus, when
> I find myself at line 332:
>
> result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2);
>
> I find that sumX2 = 0.38 = 0.62^2 and sumY2 = 0.044 = 0.21^2. In other
> words, sumX2 only considers the value for userid=4 and, sumY2 only
> considers the value for userid=4 and not all values in each vector.
>
> And, indeed, following the code through the ultimate result it
> produces is a correlation value of 1.0 for these vectors: 0.62*0.21 /
> (sqrt(0.62^2)*sqrt(0.21^2)). I would have computed a correlation value
> of 0.13/(0.84 * 0.70) = 0.21. If someone could explain the discrepancy
> to me I'd be extremely grateful.
>
>
> Thanks in advance,
> Francis

Reply via email to