This is not a bug, the similarity measure does cut-off the results that are returned.
________________________________ From: Pat Ferrel <[email protected]> To: [email protected] Sent: Tuesday, May 8, 2012 1:06 PM Subject: rowsimilarity not creating requested number of similar docs Using the below data set I ran rowsimilarity asking for 20 similar docs but got anywhere from 1 to 20. Is this the expected behavior? It would be nice to get all 20 so I can see where the similarity starts to drop off. mahout rowid -i b2/bixo-vectors/tfidf-vectors/part-r-00000 -o b2/bixo-matrix mahout rowsimilarity \ -i b2/bixo-matrix/matrix \ -o b2/bixo-similarity \ -r 5250 \ --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \ -m 20 \ -ess true Find the data here: http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 Using the same config as below kmeans example. I could file bugs but I'm not sure if this is a bug or not. On 5/8/12 9:19 AM, Pat Ferrel wrote: > BTW it seems odd that I get large numbers for distance from centroid using > clustering. Shouldn't I expect small numbers for the closest docs? I have > assumed the real distance is 1-reported distance but the distances reported > by rowsimilarity are very small as I'd expect. I was using tanimoto in both > cases as the distance measure but also tried cosine with similar results. > > On 5/8/12 9:12 AM, Pat Ferrel wrote: >> Here is a sample data set. In this case I asked for 30 and got 28 but in >> other cases the discrepancy has been greater like ask for 200 and get 38 but >> that was for a much larger data set. >> >> Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, >> mahout 0.6 >> >> command line: >> >> mahout kmeans \ >> -i b2/bixo-vectors/tfidf-vectors/ \ >> -c b2/bixo-kmeans-centroids \ >> -cl \ >> -o b2/bixo-kmeans-clusters \ >> -k 30 \ >> -ow \ >> -cd 0.01 \ >> -x 20 \ >> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure >> >> Find the data here: >> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740 >> >> >> BTW when I run rowsimilarity asking for 20 similar docs I get a max of 20 >> but sometimes many less. Shouldn't this always return the requested number? >> I'll post this question again to the the attention of the right person. >> >> On 5/8/12 6:15 AM, Paritosh Ranjan wrote: >>> I looked at the 0.6 version's code but was not able to find any reason. >>> If possible, can you share the data you are trying to cluster along with >>> the execution parameters? >>> >>> You can also open a Jira for this and provide the info there. >>> >>> On 07-05-2012 19:45, Pat Ferrel wrote: >>>> 0.6 >>>> >>>> I take it this is not expected behavior? I could be doing something >>>> stupid. I only look in the "final" directory. Looking in the others with >>>> clusterdump shows the same number of clusters and I assumed they were >>>> iterations. >>>> >>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote: >>>>> Which version are you using ? 0.6 or the current 0.7-snapshot? >>>>> >>>>> On 07-05-2012 02:19, Pat Ferrel wrote: >>>>>> What would cause kmeans to not return k clusters? As I tweak parameters >>>>>> I get different numbers of clusters but it's usually less than the k I >>>>>> pass in. Since I am not using canopies at present I would expect k to >>>>>> always be honored but the quality of the clusters would depend on the >>>>>> convergence amount and number of iterations allowed. No? >>>>> >>>>> >>>>> >>> >>> >>>
