Spark would be an excellent choice for the iterative sort of k-means. It could be good for sketch-based algorithms as well, but the difference would be much less pronounced.
On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <[email protected]> wrote: > I would think also that starting with centers in some in-memory Hadoop > platform like spark would also be a valid approach. > I think the spark demo assumes that the data set is cached vs just centers. > C > > On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <[email protected]> wrote: > > And there is also Cascading ;) : http://www.cascading.org/ > But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce. > > As for the number of reducers, you will have to do the math yourself but > I highly doubt that more than one reducer is needed (imho). But you can > indeed distribute the work by the center identifier. > > Bertrand > > > On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <[email protected]>wrote: > >> Thanks! >> *Bertrand*: I don't like the idea of using a single reducer. A better >> way for me is to write all the output of all the reducers to the same >> directory, and then distribute all the files. >> I know about Mahout of course, but I want to implement it myself. I will >> look at the documentation though. >> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll >> read the stuff you linked. >> >> >> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[email protected]> wrote: >> >>> If you're also a fan of doing things the better way, you can also >>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing >>> this via https://github.com/cloudera/ml (blog post: >>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/). >>> >>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[email protected]> >>> wrote: >>> > Hi, >>> > I'd like to implement k-means by myself, in the following naive way: >>> > Given a large set of vectors: >>> > >>> > Generate k random centers from set. >>> > Mapper reads all center and a split of the vectors set and emits for >>> each >>> > vector the closest center as a key. >>> > Reducer calculated new center and writes it. >>> > Goto step 2 until no change in the centers. >>> > >>> > My question is very basic: how do I distribute all the new centers >>> (produced >>> > by the reducers) to all the mappers? I can't use distributed cache >>> since its >>> > read-only. I can't use the context.write since it will create a file >>> for >>> > each reduce task, and I need a single file. The more general issue >>> here is >>> > how to distribute data produced by reducer to all the mappers? >>> > >>> > Thanks. >>> >>> >>> >>> -- >>> Harsh J >>> >> >> > > > -- > Bertrand Dechoux > >
