Mahout K-means has different behavior based on the number of mapping tasks

nikos Wed, 26 Sep 2012 10:10:39 -0700

I experience a strange situation when running Mahout K-means: Using thea pre-selected set of initial centroids, I run K-means on a SequenceFilegenerated by lucene.vector. The run is for testing purposes, so the fileis small (around 10MB~10000 vectors).

When K-means is executed with a single mapper (the default consideringthe Hadoop split size which in my cluster is 128MB), it reaches a givenclustering result in 2 iterations (Case A). However, I wanted to test ifthere would be any improvement/deterioration in the algorithm'sexecution speed by firing more mapping tasks (the Hadoop cluster has intotal 6 nodes). I therefore set the -Dmapred.max.split.size parameter to5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B). Iindeed succeeded in starting two mappers, but the strange thing was thatthe job finished after 5 iterations instead of 2, and that even at thefirst assignment of points to clusters, the mappers made differentchoices compared to the single-map execution . What I mean is that afterclose inspection of the clusterDump for the first iteration for both twocases, I found that in case B some points were not assigned to theirclosest cluster.

Could this behavior be justified by the existing K-means Mahoutimplementation?


Thanks in advance.

Mahout K-means has different behavior based on the number of mapping tasks

Reply via email to