My -cd was quite loose, set it at 0.1 Hmm.. maybe the data is too small, causing the low performance..?
> 200 iterations? > > What is your convergence delta? If it is too small for your distance measure > you will perform all 200 iterations, every time you cluster. > > --convergenceDelta (-cd) convergenceDelta > The convergence delta value. > Default is 0.5 > > I would set the convergence delta looser and see if 100 or even 20 iterations > produces good results. You can always tweak your other parameters to get them > tuned and up your convergence if needed. Also remember that a good > convergence is related to your distance measure so you need to think about > which distance measure works for your data. > > I generally only take 10-20 iterations using cosine distance and 0.001 as the > convergence delta, which would be 20-40 minutes for you. > > On Sep 12, 2012, at 7:26 PM, Elaine Gan <[email protected]> wrote: > > Hi, > > I'm trying to do some text analysis using mahout kmeans (clustering), > processing the data on hadoop. > --numClusters = 160 > --maxIter (-x) maxIter = 200 > > Well my data is small, around 500MB . > I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as > maximum. > When i run the mahout task, i can see that the number of map tasks are > the most 3, so i guess i do not need to do any tuning on this at this > moment. > > One iteration took around 1.5mins ~ 2mins to finish. > I am not sure whether this is normal or is it consider slow, can anyone > gives me an advice on this? > > And with x = 200, it tooks me around 200x2mins = 6 hours > to finish the whole analysis.. > Is it something which is unavoided? > The bigger the "x" is, the longer time it takes to finish the kmeans job? > > Any ways to improve on the mahout kmeans to speed it up? > > Thank you. >
