Decreasing the block size might help. Set block size << the size of input file to KMeans. ________________________________________ From: DAN HELM [[email protected]] Sent: Thursday, June 21, 2012 12:28 AM To: [email protected] Cc: [email protected] Subject: Re: seq2parse works multicore , kmeans not
Jassin, Out of curiosity, how many “part” files (vector files) were generated by the seq2sparse step for input to k-means? I have been experimenting with the CVB clustering algorithm and also had issues where only one mapper was running. In my case the problem was the Mahout rowid command which was needed to convert the output of seq2sparse to a form that CVB requires (i.e., keys had to be integers) only generates a single output file resulting in only one mapper running for CVB. I modified the Mahout rowid software to generate “n” output files per a new parameter so now I can have many mappers running at once to speed up the processing. Maybe you are having a similar issue (i.e., only one input file being processed by k-means)? Dan ________________________________ From: Jassin Meknassi <[email protected]> To: [email protected] Sent: Wednesday, June 20, 2012 5:59 PM Subject: seq2parse works multicore , kmeans not Hi, I am running kmeans clustering on a local hadoop node with 16 cores (mapred-site.xml https://gist.github.com/2962458) running seq2sparse on the input sequencefiles ( originally 64k text document with approx 100 words each) uses all the 16 cores when running over hadoop/hdfs and takes about 20min canopy is quick and gets me about 120 clusters. Running kmeans takes ages as only one map task is launched ( https://gist.github.com/2962436). I am wondering what I might be doing wrong since all cores are used in se2parse and not in kmeans. I tried settings in the bin/mahout script MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=16" MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=16" but that did not help not using hadoop by setting MAHOUT_LOCAL results to the same Thanks for helping
