Also, with 500MB of data, this is likely to only take a few minutes on a single machine with the new clustering stuff. It is hard to estimate precisely, however, due to the difference between dense and sparse cases.
On Wed, Sep 12, 2012 at 8:42 PM, Pat Ferrel <[email protected]> wrote: > 200 iterations? > > What is your convergence delta? If it is too small for your distance > measure you will perform all 200 iterations, every time you cluster. > > --convergenceDelta (-cd) convergenceDelta > The convergence delta value. > Default is 0.5 > > I would set the convergence delta looser and see if 100 or even 20 > iterations produces good results. You can always tweak your other > parameters to get them tuned and up your convergence if needed. Also > remember that a good convergence is related to your distance measure so you > need to think about which distance measure works for your data. > > I generally only take 10-20 iterations using cosine distance and 0.001 as > the convergence delta, which would be 20-40 minutes for you. > > On Sep 12, 2012, at 7:26 PM, Elaine Gan <[email protected]> wrote: > > Hi, > > I'm trying to do some text analysis using mahout kmeans (clustering), > processing the data on hadoop. > --numClusters = 160 > --maxIter (-x) maxIter = 200 > > Well my data is small, around 500MB . > I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as > maximum. > When i run the mahout task, i can see that the number of map tasks are > the most 3, so i guess i do not need to do any tuning on this at this > moment. > > One iteration took around 1.5mins ~ 2mins to finish. > I am not sure whether this is normal or is it consider slow, can anyone > gives me an advice on this? > > And with x = 200, it tooks me around 200x2mins = 6 hours > to finish the whole analysis.. > Is it something which is unavoided? > The bigger the "x" is, the longer time it takes to finish the kmeans job? > > Any ways to improve on the mahout kmeans to speed it up? > > Thank you. > > >
