rowsimilarity not creating requested number of similar docs

Pat Ferrel Tue, 08 May 2012 10:07:30 -0700

Using the below data set I ran rowsimilarity asking for 20 similar docsbut got anywhere from 1 to 20. Is this the expected behavior? It wouldbe nice to get all 20 so I can see where the similarity starts to drop off.

mahout rowid -i b2/bixo-vectors/tfidf-vectors/part-r-00000 -ob2/bixo-matrix


  mahout rowsimilarity \
      -i b2/bixo-matrix/matrix \
      -o b2/bixo-similarity \
      -r 5250 \
      --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
      -m 20 \
      -ess true

Find the data here:

http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740


Using the same config as below kmeans example.

I could file bugs but I'm not sure if this is a bug or not.

On 5/8/12 9:19 AM, Pat Ferrel wrote:

BTW it seems odd that I get large numbers for distance from centroidusing clustering. Shouldn't I expect small numbers for the closestdocs? I have assumed the real distance is 1-reported distance but thedistances reported by rowsimilarity are very small as I'd expect. Iwas using tanimoto in both cases as the distance measure but alsotried cosine with similar results.
On 5/8/12 9:12 AM, Pat Ferrel wrote:
Here is a sample data set. In this case I asked for 30 and got 28 butin other cases the discrepancy has been greater like ask for 200 andget 38 but that was for a much larger data set.
Running on my mac laptop in a single node pseudo cluster hadoop0.20.205, mahout 0.6
command line:

mahout kmeans \
    -i b2/bixo-vectors/tfidf-vectors/ \
    -c b2/bixo-kmeans-centroids \
    -cl \
    -o b2/bixo-kmeans-clusters \
    -k 30 \
    -ow \
    -cd 0.01 \
    -x 20 \
    -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
BTW when I run rowsimilarity asking for 20 similar docs I get a maxof 20 but sometimes many less. Shouldn't this always return therequested number? I'll post this question again to the the attentionof the right person.
On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
I looked at the 0.6 version's code but was not able to find any reason.
If possible, can you share the data you are trying to cluster alongwith the execution parameters?
You can also open a Jira for this and provide the info there.

On 07-05-2012 19:45, Pat Ferrel wrote:
0.6
I take it this is not expected behavior? I could be doing somethingstupid. I only look in the "final" directory. Looking in the otherswith clusterdump shows the same number of clusters and I assumedthey were iterations.
On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
Which version are you using ? 0.6 or the current 0.7-snapshot?

On 07-05-2012 02:19, Pat Ferrel wrote:
What would cause kmeans to not return k clusters? As I tweakparameters I get different numbers of clusters but it's usuallyless than the k I pass in. Since I am not using canopies atpresent I would expect k to always be honored but the quality ofthe clusters would depend on the convergence amount and number ofiterations allowed. No?

rowsimilarity not creating requested number of similar docs

Reply via email to