Thanks, John and Sean, that clarifies things quite a bit.
On Wed, Aug 22, 2012 at 5:48 PM, John Conwell <[email protected]> wrote: > I think the key point here is that these vectors should be logically > thought of as sparse vectors (not sure how they are represented in Mahout). > If a value in the vector at some position i is empty, it is essentially > not part of the calculation. And only positions, i, that have a value in > both vectors can be used as part of the calculation for the denominator. > > On Wed, Aug 22, 2012 at 2:41 PM, Sean Owen <[email protected]> wrote: > >> Depends a bit on what you mean in the example here -- are the 0 values >> observed values, or "null", a lack of an observed value? >> >> If they are really 0, then the implementation will calculate the >> values you listed. But I think you really mean the input is... >> >> v1=[1.0, 1.0, 1.0, ] >> v2=[ , 1.0, 1.0, ] >> v3=[1.0, , , ] >> v4=[1.0,1.0,1.0,1.0] >> v5=[1.0,1.0,1.0,1.0] >> >> You can't assume the missing values are 0 in general. That may make >> sense in some cases, but, for example, if your values are ratings on a >> scale of 1 to 5 this amounts to assuming that all unrated items are >> completely hated. The results will be nonsense. >> >> (Really this isn't the right example to truly illustrate that, try a >> dummy data set pretending that these are 1- to 5-star movie ratings >> and I think you'll see the similarities that result from assuming >> they're 0 don't make intuitive sense. >> >> If you want this behavior, to assume null == 0, that's what the >> PreferenceInferrer is for. You can inject any default you want, the >> one that makes sense for the data set. >> >> On Wed, Aug 22, 2012 at 5:31 PM, Francis Kelly <[email protected]> >> wrote: >> > Thanks very much for you quick - I really appreciate it! >> > >> > My follow-up question is an attempt to better understand this choice >> > of implementation. >> > >> > To take a concrete example, let's suppose that we have a system with 4 >> > users, so item vectors are 4 dimensional and we have the following 5 >> > vectors (I realize this is completely pathological example, but bear >> > with me). We have: >> > >> > v1=[1.0, 1.0, 1.0, 0] >> > v2=[0.0, 1.0, 1.0, 0] >> > v3=[1.0, 0, 0, 0] >> > v4=[1.0,1.0,1.0,1.0] >> > v5=[1.0,1.0,1.0,1.0] >> > >> > As I understand it, then, in the definition of >> > UncenteredCosineSimilarity, the Cosine Similarity between all of the >> > above would be 1.0. >> > >> > Whereas in the "traditional" definition of Cosine Similarity, we'd >> > have the following correlation values: >> > cs(v1,v2)=0.816 >> > cs(v1,v3)=0.577 >> > cs(v1,v4)=0.866 >> > cs(v4,v5)=1.0 >> > >> > Assuming I'm correct to this point, could you elaborate a little bit >> > on the rationale behind this choice? It would seem to me that, for >> > example, v1 and v2 are "more similar" (with 2 ratings in common) than >> > v1 and v3 (with just 1 rating in common). But obviously, you've >> > thought of already, so I'm curious to understand what I'm missing >> > here. I'm guessing it has something to do with your comment that the >> > calculation "is only going to make sense if the data indeed has a mean >> > of zero by nature." >> > >> > Thanks for your time on this question and all of your efforts on >> > Mahout -- it's a great project. >> > >> > best, >> > Francis >> > >> > On Wed, Aug 22, 2012 at 5:11 PM, Sean Owen <[email protected]> wrote: >> >> The similarity is only defined over the dimensions where both series >> >> have a value, yes. So the denominator and numerator are equal in this >> >> case, giving a cosine of 1, which is right in the sense that in 1D >> >> space the cosine must be 1 or -1; two vectors can only point in >> >> exactly the same or exactly opposite directions. The calculation >> >> you're trying is equivalent to pretending that the dimensions with no >> >> value have value 0.0. That is only going to make sense if the data >> >> indeed has a mean of zero by nature. >> >> >> >> On Wed, Aug 22, 2012 at 12:27 PM, Francis Kelly < >> [email protected]> wrote: >> >>> I'm writing with a question about the UncenteredCosineSimilarity >> >>> metric in Mahout 0.7 (in the context of a >> >>> GenericItemBasedRecommender). >> >>> >> >>> I'm getting a correlation value that I don't understand and I'm hoping >> >>> that someone can explain it to me. >> >>> >> >>> When I step through the code with a debugger, I find that when I'm >> >>> comparing two items in AbstractItemSimilarity.com at lines 265-266, we >> >>> have: >> >>> >> >>> PreferenceArray xPrefs = dataModel.getPreferencesForItem(itemID1); >> >>> PreferenceArray yPrefs = dataModel.getPreferencesForItem(itemID2); >> >>> >> >>> Upon inspection, we see the following vectors: >> >>> >> >>> xPrefs=GenericItemPreferenceArray[itemID:6,{1=0.31,3=0.49,4=0.62}] >> >>> yPrefs=GenericItemPreferenceArray[itemID:7,{2=0.43,4=0.21,5=0.52}]. >> >>> >> >>> My understanding of the Cosine Similarity metric is that we take the >> >>> dot product of the vectors and divide it by the product of the >> >>> vectors' lengths. Assuming that's the case, we should have a >> >>> denominator of 0.62 * 0.21 = 0.13 because the above vectors only >> >>> overlap for userid=4. For the denominator -- and this is where the >> >>> code is confusing me -- I would assume that we would have the product >> >>> of the first vector length (sqrt(0.31^2 + 0.49^2 + 0.62^2) = 0.84) and >> >>> the second (sqrt(0.43^2 + 0.21^2 + 0.52^2)= 0.70). >> >>> >> >>> The code, however, appears only to consider the places the vectors >> >>> overlap (in other words, userid=4) to compute the lengths. Thus, when >> >>> I find myself at line 332: >> >>> >> >>> result = computeResult(count, sumXY, sumX2, sumY2, sumXYdiff2); >> >>> >> >>> I find that sumX2 = 0.38 = 0.62^2 and sumY2 = 0.044 = 0.21^2. In other >> >>> words, sumX2 only considers the value for userid=4 and, sumY2 only >> >>> considers the value for userid=4 and not all values in each vector. >> >>> >> >>> And, indeed, following the code through the ultimate result it >> >>> produces is a correlation value of 1.0 for these vectors: 0.62*0.21 / >> >>> (sqrt(0.62^2)*sqrt(0.21^2)). I would have computed a correlation value >> >>> of 0.13/(0.84 * 0.70) = 0.21. If someone could explain the discrepancy >> >>> to me I'd be extremely grateful. >> >>> >> >>> >> >>> Thanks in advance, >> >>> Francis >> > > > > -- > > Thanks, > John C
