On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote: > I was trying to run the MinHash algorithm on the Reuters data set, so I did > the following before running MinHashDriver > > - Get the Reuters dataset > - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate > reuters-out from reuters-sgm(the downloaded archive) > - Run seqdirectory to convert reuters-out to SequenceFile format > - Run seq2sparse to convert SequenceFiles to sparse vector format > > I used these instructions from the K-means clustering wiki page. > > This is the command I used to run MinHashDriver > > ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input > /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash > > The output file looks something like this: > > 106460162-207863047 /reut2-015.sgm-653.txt > 106460162-207863047 /reut2-021.sgm-7.txt > 106460162-207863047 /reut2-013.sgm-307.txt > 106460162-207863047 /reut2-013.sgm-306.txt > 106460162-207863047 /reut2-014.sgm-786.txt > 106460162-207863047 /reut2-013.sgm-304.txt > 106460162-207863047 /reut2-013.sgm-303.txt > 106460162-207863047 /reut2-021.sgm-230.txt > 106460162-207863047 /reut2-012.sgm-548.txt > 106460162-207863047 /reut2-020.sgm-161.txt > 106460162-207863047 /reut2-021.sgm-553.txt > 106460162-207863047 /reut2-013.sgm-299.txt > 106460162-207863047 /reut2-015.sgm-284.txt > 106460162-207863047 /reut2-013.sgm-996.txt > 106460162-207863047 /reut2-021.sgm-441.txt > 106460162-207863047 /reut2-013.sgm-298.txt > 106460162-207863047 /reut2-013.sgm-995.txt > 106460162-207863047 /reut2-015.sgm-521.txt > 106460162-207863047 /reut2-020.sgm-162.txt > 106460162-207863047 /reut2-020.sgm-163.txt > 106460162-207863047 /reut2-013.sgm-296.txt > ... > ... > > > Is this the correct way of running MinHash. > > If yes then I would update the wiki page > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with > the instructions. > > Otherwise if someone could tell me on what am I doing wrong.
I haven't looked into the code, but I get similar outputs, so I assume it is working. Might be good to incorporate this into the build-reuters.sh as well as try it on some other input. -Grant
