Calling collect on anything  is almost always a bad idea. The only
exception is if you are looking to pass that data on to any other system &
never see it again :) .
I would say you need to implement outlier detection on the rdd & process it
in spark itself rather than calling collect on it.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <eko.harmawan.sus...@gmail.com>
wrote:

> Hi All,
>
> I have a problem that i would like to consult about spark streaming.
>
> I have a spark streaming application that parse a file (which will be
> growing as time passed by)This file contains several columns containing
> lines of numbers,
> these parsing is divided into windows (each 1 minute). Each column
> represent different entity while each row within a column represent the
> same entity (for example, first column represent temprature, second column
> represent humidty, etc, while each row represent the value of each
> attribute). I use PairDStream for each column.
>
> Afterwards, I need to run a time consuming algorithm (outlier detection,
> for now i use box plot algorithm) for each RDD of each PairDStream.
>
> To run the outlier detection, currently i am thinking about to call
> collect on each of the PairDStream from method forEachRDD and then i get
> the List of the items, and then pass the each list of items to a thread.
> Each thread runs the outlier detection algorithm and process the result.
>
> I run the outlier detection in separate thread in order not to put too
> much burden on spark streaming task. So, I would like to ask if this model
> has a risk? or is there any alternatives provided by the framework such
> that i don't have to run a separate thread for this?
>
> Thank you for your attention.
>
>
>
> --
> Best Regards,
> Eko Susilo
>

Reply via email to