Re: Naïve k-means using hadoop

Ted Dunning Wed, 27 Mar 2013 09:49:49 -0700

Spark would be an excellent choice for the iterative sort of k-means.

It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.




On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <[email protected]> wrote:

> I would think also that starting with centers in some in-memory Hadoop
> platform like spark would also be a valid approach.
> I think the spark demo assumes that the data set is cached vs just centers.
> C
>
> On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <[email protected]> wrote:
>
> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
>
> As for the number of reducers, you will have to do the math yourself but
> I highly doubt that more than one reducer is needed (imho). But you can
> indeed distribute the work by the center identifier.
>
> Bertrand
>
>
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <[email protected]>wrote:
>
>> Thanks!
>> *Bertrand*: I don't like the idea of using a single reducer. A better
>> way for me is to write all the output of all the reducers to the same
>> directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will
>> look at the documentation though.
>> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll
>> read the stuff you linked.
>>
>>
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <[email protected]> wrote:
>>
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>>
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <[email protected]>
>>> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for
>>> each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers
>>> (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache
>>> since its
>>> > read-only. I can't use the context.write since it will create a file
>>> for
>>> > each reduce task, and I need a single file. The more general issue
>>> here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>

Re: Naïve k-means using hadoop

Reply via email to