Re: Mahout K-means has different behavior based on the number of mapping tasks

nikos Wed, 26 Sep 2012 10:53:21 -0700

The centroids have been selected in a previous execution of MahoutK-means via randomSeed generator.


On 09/26/2012 08:43 PM, paritosh ranjan wrote:

By saying "Using the a pre-selected set of initial centroids" do you mean
that the initial centroids were same in both executions?
In other words, how are you choosing your initial centroids?


On Wed, Sep 26, 2012 at 10:40 PM, nikos <[email protected]> wrote:

I experience a strange situation when running Mahout K-means: Using the a
pre-selected set of initial centroids, I run K-means on a SequenceFile
generated by lucene.vector. The run is for testing purposes, so the file is
small (around 10MB~10000 vectors).

When K-means is executed with a single mapper (the default considering the
Hadoop split size which in my cluster is 128MB), it reaches a given
clustering result in 2 iterations (Case A). However, I wanted to test if
there would be any improvement/deterioration in the algorithm's execution
speed by firing more mapping tasks (the Hadoop cluster has in total 6
nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880
bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed
succeeded in starting two mappers, but the strange thing was that the job
finished after 5 iterations instead of 2, and that even at the first
assignment of points to clusters, the mappers made different choices
compared to the single-map execution . What I mean is that after close
inspection of the clusterDump for the first iteration for both two cases, I
found that in case B some points were not assigned to their closest cluster.

Could this behavior be justified by the existing K-means Mahout
implementation?

Thanks in advance.

Re: Mahout K-means has different behavior based on the number of mapping tasks

Reply via email to