Of course, you should check out Mahout, at least the documentation, even if you really want to implement it by yourself. https://cwiki.apache.org/MAHOUT/k-means-clustering.html
Regards Bertrand On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <[email protected]>wrote: > Actually for the first step, the client could create a file with the > centers and then put it on hdfs and use it with distributed cache. > A single reducer might be enough and that case, its only responsibility is > to create the file with the updated centers. > You can then use this new file again in the distributed cache instead of > the first. > > Your real input will always be your set of points. > > Regards > > Bertrand > > PS : One reducer should be enough because it only needs to aggregate the > partial update of each mapper. The volume of data send to the reducer will > change according to the number of centers but not the number of points. > > > On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <[email protected]>wrote: > >> Hi, >> I'd like to implement k-means by myself, in the following naive way: >> Given a large set of vectors: >> >> 1. Generate k random centers from set. >> 2. Mapper reads all center and a split of the vectors set and emits >> for each vector the closest center as a key. >> 3. Reducer calculated new center and writes it. >> 4. Goto step 2 until no change in the centers. >> >> My question is very basic: how do I distribute all the new centers >> (produced by the reducers) to all the mappers? I can't use distributed >> cache since its read-only. I can't use the context.write since it will >> create a file for each reduce task, and I need a single file. The more >> general issue here is how to distribute data produced by reducer to all the >> mappers? >> >> Thanks. >> > > -- Bertrand Dechoux
