Re: MinHash Clustering in Mahout

Grant Ingersoll Wed, 23 Nov 2011 06:30:34 -0800

I mod'd VectorDump to take in a --filter option (trunk) so that one can print 
out specific named vectors.  Here's the vecs for the two items you mention 
after running them through seq2sparse as configured in cluster-reuters.sh:


/reut2-015.sgm-653.txt:{34519:5.532691955566406,3950:4.322702884674072,33805:5.982217311859131,39687:2.6266393661499023,1982:2.714293956756592,7424:4.58083438873291,8110:9.033519744873047,19509:8.782204627990723,14143:2.336308717727661,24982:4.279929161071777,12254:3.6424925327301025,16280:4.5305399894714355}
/reut2-021.sgm-7.txt:{3730:3.051023483276367,7391:7.888387203216553,36570:9.880817413330078,30839:3.5672693252563477,27570:4.24602746963501,20512:3.8622241020202637,20510:6.3944621086120605,5636:9.593134880065918,28018:6.200305938720703,32492:4.762823581695557,5703:3.9078562259674072,2962:2.807265043258667,41625:2.2218031883239746}



On Oct 26, 2011, at 10:51 AM, Suneel Marthi wrote:

> I am still trying to fully understand minHash algorithm and I had the same 
> results like below when running the MinHashDriver.
> 
> I have a use case wherein I need to determine the content similarity of 2 
> documents like what's been described in Andrei Broder's paper - 'Identifying 
> and Filtering Near-Duplicate Documents' 
> (http://dl.acm.org/citation.cfm?id=736184).  
> 
> I started dissecting the clusters generated by Mahout's MinHashDriver to 
> compare document content equality and to determine how accurate the 
> clustering was?
> I do see that the first 2 files from the output below were put in the same 
> cluster 106460162-207863047; thought the actual text content in both the 
> files is different.  How?
> 
> I am assuming that the NGram attribute was set to the default value of 1 when 
> creating the tf-idf vectors from sequence files.  
> 
> Suneel
> 
> 
> 
> ________________________________
> From: Grant Ingersoll <[email protected]>
> To: [email protected]
> Sent: Tuesday, October 25, 2011 5:55 AM
> Subject: Re: MinHash Clustering in Mahout
> 
> 
> On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote:
> 
>> I was trying to run the MinHash algorithm on the Reuters data set, so I did
>> the following before running MinHashDriver
>> 
>>    - Get the Reuters dataset
>>    - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
>>    reuters-out from reuters-sgm(the downloaded archive)
>>    - Run seqdirectory to convert reuters-out to SequenceFile format
>>    - Run seq2sparse to convert SequenceFiles to sparse vector format
>> 
>> I used these instructions from the K-means clustering wiki page.
>> 
>> This is the command I used to run MinHashDriver
>> 
>> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
>> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash
>> 
>> The output file looks something like this:
>> 
>> 106460162-207863047
> /reut2-015.sgm-653.txt
>> 106460162-207863047 /reut2-021.sgm-7.txt
>> 106460162-207863047 /reut2-013.sgm-307.txt
>> 106460162-207863047 /reut2-013.sgm-306.txt
>> 106460162-207863047 /reut2-014.sgm-786.txt
>> 106460162-207863047 /reut2-013.sgm-304.txt
>> 106460162-207863047 /reut2-013.sgm-303.txt
>> 106460162-207863047 /reut2-021.sgm-230.txt
>> 106460162-207863047 /reut2-012.sgm-548.txt
>> 106460162-207863047 /reut2-020.sgm-161.txt
>> 106460162-207863047 /reut2-021.sgm-553.txt
>> 106460162-207863047 /reut2-013.sgm-299.txt
>> 106460162-207863047 /reut2-015.sgm-284.txt
>> 106460162-207863047 /reut2-013.sgm-996.txt
>> 106460162-207863047 /reut2-021.sgm-441.txt
>> 106460162-207863047 /reut2-013.sgm-298.txt
>> 106460162-207863047 /reut2-013.sgm-995.txt
>> 106460162-207863047 /reut2-015.sgm-521.txt
>> 106460162-207863047 /reut2-020.sgm-162.txt
>> 106460162-207863047
> /reut2-020.sgm-163.txt
>> 106460162-207863047 /reut2-013.sgm-296.txt
>> ...
>> ...
>> 
>> 
>> Is this the correct way of running MinHash.
>> 
>> If yes then I would update the wiki page
>> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
>> the instructions.
>> 
>> Otherwise if someone could tell me on what am I doing wrong.
> 
> I haven't looked into the code, but I get similar outputs, so I assume it is 
> working.  Might be good to incorporate this into the build-reuters.sh as well 
> as try it on some other input.
> 
> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: MinHash Clustering in Mahout

Reply via email to