Re: Mahout K-means has different behavior based on the number of mapping tasks

paritosh ranjan Wed, 26 Sep 2012 10:44:01 -0700

By saying "Using the a pre-selected set of initial centroids" do you mean
that the initial centroids were same in both executions?
In other words, how are you choosing your initial centroids?


On Wed, Sep 26, 2012 at 10:40 PM, nikos <[email protected]> wrote:

> I experience a strange situation when running Mahout K-means: Using the a
> pre-selected set of initial centroids, I run K-means on a SequenceFile
> generated by lucene.vector. The run is for testing purposes, so the file is
> small (around 10MB~10000 vectors).
>
> When K-means is executed with a single mapper (the default considering the
> Hadoop split size which in my cluster is 128MB), it reaches a given
> clustering result in 2 iterations (Case A). However, I wanted to test if
> there would be any improvement/deterioration in the algorithm's execution
> speed by firing more mapping tasks (the Hadoop cluster has in total 6
> nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880
> bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed
> succeeded in starting two mappers, but the strange thing was that the job
> finished after 5 iterations instead of 2, and that even at the first
> assignment of points to clusters, the mappers made different choices
> compared to the single-map execution . What I mean is that after close
> inspection of the clusterDump for the first iteration for both two cases, I
> found that in case B some points were not assigned to their closest cluster.
>
> Could this behavior be justified by the existing K-means Mahout
> implementation?
>
> Thanks in advance.
>
>
>

Re: Mahout K-means has different behavior based on the number of mapping tasks

Reply via email to