Thanks for the useful links.

Cheers,

Julien



2014-08-21 11:47 GMT+02:00 Yanbo Liang <yanboha...@gmail.com>:

> In Spark/MLlib, task serialization such as cluster centers of k-means was
> replaced by broadcast variables due to performance.
> You can refer this PR https://github.com/apache/spark/pull/1427
> And current k-means implementation of MLlib, it's benefited from sparse
> vector computing.
> http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
>
>
>
> 2014-08-21 15:40 GMT+08:00 Julien Naour <julna...@gmail.com>:
>
> My Arrays are in fact Array[Array[Long]] and like 17x150000 (17 centers
>> with 150 000 modalities, i'm working on qualitative variables) so they are
>> pretty large. I'm working on it to get them smaller, it's mostly a sparse
>> matrix.
>> Good things to know nervertheless.
>>
>> Thanks,
>>
>> Julien Naour
>>
>>
>> 2014-08-20 23:27 GMT+02:00 Patrick Wendell <pwend...@gmail.com>:
>>
>> For large objects, it will be more efficient to broadcast it. If your
>>> array is small it won't really matter. How many centers do you have? Unless
>>> you are finding that you have very large tasks (and Spark will print a
>>> warning about this), it could be okay to just reference it directly.
>>>
>>>
>>> On Wed, Aug 20, 2014 at 1:18 AM, Julien Naour <julna...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question about broadcast. I'm working on a clustering
>>>> algorithm close to KMeans. It seems that KMeans broadcast clusters centers
>>>> at each step. For the moment I just use my centers as Array that I call
>>>> directly in my map at each step. Could it be more efficient to use
>>>> broadcast instead of simple variable?
>>>>
>>>> Cheers,
>>>>
>>>> Julien Naour
>>>>
>>>
>>>
>>
>

Reply via email to