Jassin,
 
Out of curiosity, how many “part” files (vector files) were generated by
the seq2sparse step for input to k-means?
I have been experimenting with the CVB clustering algorithm
and also had issues where only one mapper was running.  In my case the problem 
was the Mahout rowid command
which was needed to convert the output of seq2sparse to a form that CVB
requires (i.e., keys had to be integers) only generates a single output file
resulting in only one mapper running for CVB.  I modified the Mahout rowid 
software to generate “n” output files per a
new parameter so now I can have many mappers running at once to speed up the
processing.   Maybe you are having a
similar issue (i.e., only one input file being processed by k-means)?
 
Dan

________________________________
 From: Jassin Meknassi <[email protected]>
To: [email protected] 
Sent: Wednesday, June 20, 2012 5:59 PM
Subject: seq2parse works multicore , kmeans not
  
Hi,

I am running kmeans clustering on a local hadoop node with 16 cores
(mapred-site.xml https://gist.github.com/2962458)

running seq2sparse on the input sequencefiles ( originally 64k text
document with approx 100 words each) uses all the 16 cores when running
over hadoop/hdfs and takes about 20min

canopy is quick and gets me about 120 clusters.

Running kmeans takes ages as only one map task is launched (
https://gist.github.com/2962436).

I am wondering what I might be doing wrong since all cores are used in
se2parse and not in kmeans.

I tried settings in the bin/mahout script
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.map.tasks=16"
MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.reduce.tasks=16"

but that did not help

not using hadoop by setting MAHOUT_LOCAL results to the same

Thanks for helping

Reply via email to