Re: Naïve k-means using hadoop

Yaron Gonen Wed, 27 Mar 2013 06:05:21 -0700

Thanks!
*Bertrand*: I don't like the idea of using a single reducer. A better way
for me is to write all the output of all the reducers to the same
directory, and then distribute all the files.
I know about Mahout of course, but I want to implement it myself. I will
look at the documentation though.
*Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
the stuff you linked.



On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[email protected]> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[email protected]>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Reply via email to