Re: Naïve k-means using hadoop

Bertrand Dechoux Wed, 27 Mar 2013 05:42:22 -0700

Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html


Regards

Bertrand

On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <[email protected]>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
>
> Your real input will always be your set of points.
>
> Regards
>
> Bertrand
>
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
>
>
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <[email protected]>wrote:
>
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>>
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>>
>> Thanks.
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Reply via email to