Hello everyone,

I'm having problems using the rowsimilarity CLI interface. While the job
runs successfully, the results are quite different than I expected them to
be ...

I have about 10,000 vectors with 100 values each (this is the result of LDA
algorithm creating 100 topics). The values of the vector sum to 1. I ran
rowsimilarity using cosine similarity:

sudo mahout rowsimilarity -i docTopicOutputPath -o cosineRowSimilarity -r
100 -s SIMILARITY_COSINE -m 10 -ow


But the results were quite surprising (=weird). Let me just give you an
example ... So here is the 10 most similar vectors to vector 0:

{0:1.0,5096:0.9999999900023594,2084:0.9999999900006721,9508:0.9999999913198951,5418:0.9999999900023594,5973:0.9999999900023594,850:0.9999999900023594,3520:0.9999999900023594,3810:0.9999999900023594,2332:0.9999999900023594}

So let's compare vector 0 and vector 5096 ... While it is weird that the
similarities are so high, it doesn't make much sense, since the vectors are
completely different. Let me just sort the values of the vectors by size
and show you top 10, other values get more and more close to zero.


Vector 0:
{77:0.38869500338296026,2:0.19193420441506734,97:0.15439188702913675,51:0.14734148341655237,93:0.02296211673521741,52:0.02231413626205526,54:0.015647218118985836,1:0.014325646147841266,69:0.011511397206979742,72:0.00598588285753453}
Vector 5096:
{29:0.3561851608208644,23:0.2140023076327667,50:0.10323038286664168,46:0.05394590979222972,84:0.03169656621316953,94:0.030441225154472437,85:0.02555904724179884,64:0.024968087669536388,13:0.02295201078820391,80:0.022068065896437266}


As you can see, the highest values are all on completely different
dimensions ... Am I getting the concept of how Rowsimilarity works wrong -
or is my error not obvious from what I showed you here and I should check
some other place, like my input data?

Thanks in advance for any help!

--David Starina

Reply via email to