What distance measure?

On Sep 12, 2012, at 10:37 PM, Elaine Gan <[email protected]> wrote:

My -cd was quite loose, set it at 0.1

Hmm.. maybe the data is too small, causing the low performance..?


> 200 iterations?
> 
> What is your convergence delta? If it is too small for your distance measure 
> you will perform all 200 iterations, every time you cluster. 
> 
>  --convergenceDelta (-cd) convergenceDelta                  
>          The convergence delta value.       
>           Default is 0.5  
> 
> I would set the convergence delta looser and see if 100 or even 20 iterations 
> produces good results. You can always tweak your other parameters to get them 
> tuned and up your convergence if needed. Also remember that a good 
> convergence is related to your distance measure so you need to think about 
> which distance measure works for your data.
> 
> I generally only take 10-20 iterations using cosine distance and 0.001 as the 
> convergence delta, which would be 20-40 minutes for you.
> 
> On Sep 12, 2012, at 7:26 PM, Elaine Gan <[email protected]> wrote:
> 
> Hi,
> 
> I'm trying to do some text analysis using mahout kmeans (clustering),
> processing the data on hadoop.
> --numClusters = 160 
> --maxIter (-x) maxIter = 200
> 
> Well my data is small, around 500MB .
> I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as
> maximum.
> When i run the mahout task, i can see that the number of map tasks are
> the most 3, so i guess i do not need to do any tuning on this at this
> moment.
> 
> One iteration took around 1.5mins ~ 2mins to finish.
> I am not sure whether this is normal or is it consider slow, can anyone
> gives me an advice on this?
> 
> And with x = 200, it tooks me around 200x2mins = 6 hours 
> to finish the whole analysis..
> Is it something which is unavoided?
> The bigger the "x" is, the longer time it takes to finish the kmeans job?
> 
> Any ways to improve on the mahout kmeans to speed it up?
> 
> Thank you.
> 


Reply via email to