You can also try to find initial clusters first using canopy clustering, its a fast single iteration clustering algorithm.
https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering

Canopy clustering would provide you better initial clusters which you can feed into kmeans for faster convergence. Canopy clustering in itself provided pretty good results even in a single iteration ( at least I have observed it, but it depends from case to case ). Try to find good t1 and t2 values for Canopy with some initial experiments ( as it will depend on the distancemeasure you are using ), then you might not need kmeans at all ( but it depends on the need and the data ).

Good luck.

On 13-09-2012 11:07, Elaine Gan wrote:
My -cd was quite loose, set it at 0.1

Hmm.. maybe the data is too small, causing the low performance..?


200 iterations?

What is your convergence delta? If it is too small for your distance measure 
you will perform all 200 iterations, every time you cluster.

   --convergenceDelta (-cd) convergenceDelta
           The convergence delta value.
            Default is 0.5

I would set the convergence delta looser and see if 100 or even 20 iterations 
produces good results. You can always tweak your other parameters to get them 
tuned and up your convergence if needed. Also remember that a good convergence 
is related to your distance measure so you need to think about which distance 
measure works for your data.

I generally only take 10-20 iterations using cosine distance and 0.001 as the 
convergence delta, which would be 20-40 minutes for you.

On Sep 12, 2012, at 7:26 PM, Elaine Gan <[email protected]> wrote:

Hi,

I'm trying to do some text analysis using mahout kmeans (clustering),
processing the data on hadoop.
--numClusters = 160
--maxIter (-x) maxIter = 200

Well my data is small, around 500MB .
I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as
maximum.
When i run the mahout task, i can see that the number of map tasks are
the most 3, so i guess i do not need to do any tuning on this at this
moment.

One iteration took around 1.5mins ~ 2mins to finish.
I am not sure whether this is normal or is it consider slow, can anyone
gives me an advice on this?

And with x = 200, it tooks me around 200x2mins = 6 hours
to finish the whole analysis..
Is it something which is unavoided?
The bigger the "x" is, the longer time it takes to finish the kmeans job?

Any ways to improve on the mahout kmeans to speed it up?

Thank you.



Reply via email to