Re: rowsimilarity not creating requested number of similar docs

Suneel Marthi Sat, 12 May 2012 19:10:41 -0700

This is not a bug, the similarity measure does cut-off the results that are 
returned.




________________________________
 From: Pat Ferrel <[email protected]>
To: [email protected] 
Sent: Tuesday, May 8, 2012 1:06 PM
Subject: rowsimilarity not creating requested number of similar docs
 
Using the below data set I ran rowsimilarity asking for 20 similar docs but got 
anywhere from 1 to 20. Is this the expected behavior? It would be nice to get 
all 20 so I can see where the similarity starts to drop off.

  mahout rowid     -i b2/bixo-vectors/tfidf-vectors/part-r-00000     -o 
b2/bixo-matrix

  mahout rowsimilarity \
      -i b2/bixo-matrix/matrix \
      -o b2/bixo-similarity \
      -r 5250 \
      --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT \
      -m 20 \
      -ess true

Find the data here:
http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
 

Using the same config as below kmeans example.

I could file bugs but I'm not sure if this is a bug or not.

On 5/8/12 9:19 AM, Pat Ferrel wrote:
> BTW it seems odd that I get large numbers for distance from centroid using 
> clustering. Shouldn't I expect small numbers for the closest docs? I have 
> assumed the real distance is 1-reported distance but the distances reported 
> by rowsimilarity are very small as I'd expect. I was using tanimoto in both 
> cases as the distance measure but also tried cosine with similar results.
> 
> On 5/8/12 9:12 AM, Pat Ferrel wrote:
>> Here is a sample data set. In this case I asked for 30 and got 28 but in 
>> other cases the discrepancy has been greater like ask for 200 and get 38 but 
>> that was for a much larger data set.
>> 
>> Running on my mac laptop in a single node pseudo cluster hadoop 0.20.205, 
>> mahout 0.6
>> 
>> command line:
>> 
>> mahout kmeans \
>>     -i b2/bixo-vectors/tfidf-vectors/ \
>>     -c b2/bixo-kmeans-centroids \
>>     -cl \
>>     -o b2/bixo-kmeans-clusters \
>>     -k 30 \
>>     -ow \
>>     -cd 0.01 \
>>     -x 20 \
>>     -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure
>> 
>> Find the data here:
>> http://cloud.occamsmachete.com/apps/files_sharing/get.php?token=0b2dacddca05c0ee48cbebd05048434425b86740
>>  
>> 
>> BTW when I run rowsimilarity asking for 20 similar docs I get a max of 20 
>> but sometimes many less. Shouldn't this always return the requested number? 
>> I'll post this question again to the the attention of the right person.
>> 
>> On 5/8/12 6:15 AM, Paritosh Ranjan wrote:
>>> I looked at the 0.6 version's code but was not able to find any reason.
>>> If possible, can you share the data you are trying to cluster along with 
>>> the execution parameters?
>>> 
>>> You can also open a Jira for this and provide the info there.
>>> 
>>> On 07-05-2012 19:45, Pat Ferrel wrote:
>>>> 0.6
>>>> 
>>>> I take it this is not expected behavior? I could be doing something 
>>>> stupid. I only look in the "final" directory. Looking in the others with 
>>>> clusterdump shows the same number of clusters and I assumed they were 
>>>> iterations.
>>>> 
>>>> On 5/7/12 1:21 AM, Paritosh Ranjan wrote:
>>>>> Which version are you using ? 0.6 or the current 0.7-snapshot?
>>>>> 
>>>>> On 07-05-2012 02:19, Pat Ferrel wrote:
>>>>>> What would cause kmeans to not return k clusters? As I tweak parameters 
>>>>>> I get different numbers of clusters but it's usually less than the k I 
>>>>>> pass in. Since I am not using canopies at present I would expect k to 
>>>>>> always be honored but the quality of the clusters would depend on the 
>>>>>> convergence amount and number of iterations allowed. No?
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>>

Re: rowsimilarity not creating requested number of similar docs

Reply via email to