I was trying to run the MinHash algorithm on the Reuters data set, so I did the following before running MinHashDriver
- Get the Reuters dataset - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate reuters-out from reuters-sgm(the downloaded archive) - Run seqdirectory to convert reuters-out to SequenceFile format - Run seq2sparse to convert SequenceFiles to sparse vector format I used these instructions from the K-means clustering wiki page. This is the command I used to run MinHashDriver ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash The output file looks something like this: 106460162-207863047 /reut2-015.sgm-653.txt 106460162-207863047 /reut2-021.sgm-7.txt 106460162-207863047 /reut2-013.sgm-307.txt 106460162-207863047 /reut2-013.sgm-306.txt 106460162-207863047 /reut2-014.sgm-786.txt 106460162-207863047 /reut2-013.sgm-304.txt 106460162-207863047 /reut2-013.sgm-303.txt 106460162-207863047 /reut2-021.sgm-230.txt 106460162-207863047 /reut2-012.sgm-548.txt 106460162-207863047 /reut2-020.sgm-161.txt 106460162-207863047 /reut2-021.sgm-553.txt 106460162-207863047 /reut2-013.sgm-299.txt 106460162-207863047 /reut2-015.sgm-284.txt 106460162-207863047 /reut2-013.sgm-996.txt 106460162-207863047 /reut2-021.sgm-441.txt 106460162-207863047 /reut2-013.sgm-298.txt 106460162-207863047 /reut2-013.sgm-995.txt 106460162-207863047 /reut2-015.sgm-521.txt 106460162-207863047 /reut2-020.sgm-162.txt 106460162-207863047 /reut2-020.sgm-163.txt 106460162-207863047 /reut2-013.sgm-296.txt ... ... Is this the correct way of running MinHash. If yes then I would update the wiki page https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with the instructions. Otherwise if someone could tell me on what am I doing wrong. -- Regards, Varun Thacker http://varunthacker.wordpress.com
