The second choice is better. Once you call collect() you are pulling
all of the data onto a single node, you want to do most of the
processing  in parallel on the cluster, which is what map() will do.
Ideally you'd try to summarize the data or reduce it before calling
collect().

On Fri, Dec 5, 2014 at 5:26 AM, david <david...@free.fr> wrote:
> hi,
>
>   What is the bet way to process a batch window in SparkStreaming :
>
>     kafkaStream.foreachRDD(rdd => {
>       rdd.collect().foreach(event => {
>         // process the event
>         process(event)
>       })
>     })
>
>
> Or
>
>     kafkaStream.foreachRDD(rdd => {
>       rdd.map(event => {
>         // process the event
>         process(event)
>       }).collect()
>     })
>
>
> thank's
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to