I was trying to run the MinHash algorithm on the Reuters data set, so I did
the following before running MinHashDriver

   - Get the Reuters dataset
   - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
   reuters-out from reuters-sgm(the downloaded archive)
   - Run seqdirectory to convert reuters-out to SequenceFile format
   - Run seq2sparse to convert SequenceFiles to sparse vector format

I used these instructions from the K-means clustering wiki page.

This is the command I used to run MinHashDriver

./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input
/home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash

The output file looks something like this:

106460162-207863047 /reut2-015.sgm-653.txt
106460162-207863047 /reut2-021.sgm-7.txt
106460162-207863047 /reut2-013.sgm-307.txt
106460162-207863047 /reut2-013.sgm-306.txt
106460162-207863047 /reut2-014.sgm-786.txt
106460162-207863047 /reut2-013.sgm-304.txt
106460162-207863047 /reut2-013.sgm-303.txt
106460162-207863047 /reut2-021.sgm-230.txt
106460162-207863047 /reut2-012.sgm-548.txt
106460162-207863047 /reut2-020.sgm-161.txt
106460162-207863047 /reut2-021.sgm-553.txt
106460162-207863047 /reut2-013.sgm-299.txt
106460162-207863047 /reut2-015.sgm-284.txt
106460162-207863047 /reut2-013.sgm-996.txt
106460162-207863047 /reut2-021.sgm-441.txt
106460162-207863047 /reut2-013.sgm-298.txt
106460162-207863047 /reut2-013.sgm-995.txt
106460162-207863047 /reut2-015.sgm-521.txt
106460162-207863047 /reut2-020.sgm-162.txt
106460162-207863047 /reut2-020.sgm-163.txt
106460162-207863047 /reut2-013.sgm-296.txt
...
...


Is this the correct way of running MinHash.

If yes then I would update the wiki page
https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with
the instructions.

Otherwise if someone could tell me on what am I doing wrong.

-- 
Regards,
Varun Thacker
http://varunthacker.wordpress.com

Reply via email to