Thanks! *Bertrand*: I don't like the idea of using a single reducer. A better way for me is to write all the output of all the reducers to the same directory, and then distribute all the files. I know about Mahout of course, but I want to implement it myself. I will look at the documentation though. *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read the stuff you linked.
On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[email protected]> wrote: > If you're also a fan of doing things the better way, you can also > checkout some Apache Crunch (http://crunch.apache.org) ways of doing > this via https://github.com/cloudera/ml (blog post: > http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/). > > On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[email protected]> > wrote: > > Hi, > > I'd like to implement k-means by myself, in the following naive way: > > Given a large set of vectors: > > > > Generate k random centers from set. > > Mapper reads all center and a split of the vectors set and emits for each > > vector the closest center as a key. > > Reducer calculated new center and writes it. > > Goto step 2 until no change in the centers. > > > > My question is very basic: how do I distribute all the new centers > (produced > > by the reducers) to all the mappers? I can't use distributed cache since > its > > read-only. I can't use the context.write since it will create a file for > > each reduce task, and I need a single file. The more general issue here > is > > how to distribute data produced by reducer to all the mappers? > > > > Thanks. > > > > -- > Harsh J >
