Hello everyone, I'm having problems using the rowsimilarity CLI interface. While the job runs successfully, the results are quite different than I expected them to be ...
I have about 10,000 vectors with 100 values each (this is the result of LDA algorithm creating 100 topics). The values of the vector sum to 1. I ran rowsimilarity using cosine similarity: sudo mahout rowsimilarity -i docTopicOutputPath -o cosineRowSimilarity -r 100 -s SIMILARITY_COSINE -m 10 -ow But the results were quite surprising (=weird). Let me just give you an example ... So here is the 10 most similar vectors to vector 0: {0:1.0,5096:0.9999999900023594,2084:0.9999999900006721,9508:0.9999999913198951,5418:0.9999999900023594,5973:0.9999999900023594,850:0.9999999900023594,3520:0.9999999900023594,3810:0.9999999900023594,2332:0.9999999900023594} So let's compare vector 0 and vector 5096 ... While it is weird that the similarities are so high, it doesn't make much sense, since the vectors are completely different. Let me just sort the values of the vectors by size and show you top 10, other values get more and more close to zero. Vector 0: {77:0.38869500338296026,2:0.19193420441506734,97:0.15439188702913675,51:0.14734148341655237,93:0.02296211673521741,52:0.02231413626205526,54:0.015647218118985836,1:0.014325646147841266,69:0.011511397206979742,72:0.00598588285753453} Vector 5096: {29:0.3561851608208644,23:0.2140023076327667,50:0.10323038286664168,46:0.05394590979222972,84:0.03169656621316953,94:0.030441225154472437,85:0.02555904724179884,64:0.024968087669536388,13:0.02295201078820391,80:0.022068065896437266} As you can see, the highest values are all on completely different dimensions ... Am I getting the concept of how Rowsimilarity works wrong - or is my error not obvious from what I showed you here and I should check some other place, like my input data? Thanks in advance for any help! --David Starina