Re: Mahout K-means has different behavior based on the number of mapping tasks

paritosh ranjan Wed, 26 Sep 2012 11:34:20 -0700

And same set of centroids were used for both executions?

On Wed, Sep 26, 2012 at 11:22 PM, nikos <[email protected]> wrote:


> The centroids have been selected in a previous execution of Mahout K-means
> via randomSeed generator.
>
>
> On 09/26/2012 08:43 PM, paritosh ranjan wrote:
>
>> By saying "Using the a pre-selected set of initial centroids" do you mean
>> that the initial centroids were same in both executions?
>> In other words, how are you choosing your initial centroids?
>>
>> On Wed, Sep 26, 2012 at 10:40 PM, nikos <[email protected]> wrote:
>>
>>  I experience a strange situation when running Mahout K-means: Using the a
>>> pre-selected set of initial centroids, I run K-means on a SequenceFile
>>> generated by lucene.vector. The run is for testing purposes, so the file
>>> is
>>> small (around 10MB~10000 vectors).
>>>
>>> When K-means is executed with a single mapper (the default considering
>>> the
>>> Hadoop split size which in my cluster is 128MB), it reaches a given
>>> clustering result in 2 iterations (Case A). However, I wanted to test if
>>> there would be any improvement/deterioration in the algorithm's execution
>>> speed by firing more mapping tasks (the Hadoop cluster has in total 6
>>> nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880
>>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed
>>> succeeded in starting two mappers, but the strange thing was that the job
>>> finished after 5 iterations instead of 2, and that even at the first
>>> assignment of points to clusters, the mappers made different choices
>>> compared to the single-map execution . What I mean is that after close
>>> inspection of the clusterDump for the first iteration for both two
>>> cases, I
>>> found that in case B some points were not assigned to their closest
>>> cluster.
>>>
>>> Could this behavior be justified by the existing K-means Mahout
>>> implementation?
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>>
>

Re: Mahout K-means has different behavior based on the number of mapping tasks

Reply via email to