Thanks for the useful links. Cheers,
Julien 2014-08-21 11:47 GMT+02:00 Yanbo Liang <yanboha...@gmail.com>: > In Spark/MLlib, task serialization such as cluster centers of k-means was > replaced by broadcast variables due to performance. > You can refer this PR https://github.com/apache/spark/pull/1427 > And current k-means implementation of MLlib, it's benefited from sparse > vector computing. > http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 > > > > 2014-08-21 15:40 GMT+08:00 Julien Naour <julna...@gmail.com>: > > My Arrays are in fact Array[Array[Long]] and like 17x150000 (17 centers >> with 150 000 modalities, i'm working on qualitative variables) so they are >> pretty large. I'm working on it to get them smaller, it's mostly a sparse >> matrix. >> Good things to know nervertheless. >> >> Thanks, >> >> Julien Naour >> >> >> 2014-08-20 23:27 GMT+02:00 Patrick Wendell <pwend...@gmail.com>: >> >> For large objects, it will be more efficient to broadcast it. If your >>> array is small it won't really matter. How many centers do you have? Unless >>> you are finding that you have very large tasks (and Spark will print a >>> warning about this), it could be okay to just reference it directly. >>> >>> >>> On Wed, Aug 20, 2014 at 1:18 AM, Julien Naour <julna...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I have a question about broadcast. I'm working on a clustering >>>> algorithm close to KMeans. It seems that KMeans broadcast clusters centers >>>> at each step. For the moment I just use my centers as Array that I call >>>> directly in my map at each step. Could it be more efficient to use >>>> broadcast instead of simple variable? >>>> >>>> Cheers, >>>> >>>> Julien Naour >>>> >>> >>> >> >