Re: spark streaming kafa best practices ?
Hi, On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell wrote: > > On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas > wrote: > > I was wondering why one would choose for rdd.map vs rdd.foreach to > execute a > > side-effecting function on an RDD. > Personally, I like to get the count of processed items, so I do something like rdd.map(item => processItem(item)).count() instead of rdd.foreach(item => processItem(item)) but I would be happy to learn about a better way. Tobias
Re: spark streaming kafa best practices ?
Foreach is slightly more efficient because Spark doesn't bother to try and collect results from each task since it's understood there will be no return type. I think the difference is very marginal though - it's mostly stylistic... typically you use foreach for something that is intended to produce a side effect and map for something that will return a new dataset. On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas wrote: > Patrick, > > I was wondering why one would choose for rdd.map vs rdd.foreach to execute a > side-effecting function on an RDD. > > -kr, Gerard. > > On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell wrote: >> >> The second choice is better. Once you call collect() you are pulling >> all of the data onto a single node, you want to do most of the >> processing in parallel on the cluster, which is what map() will do. >> Ideally you'd try to summarize the data or reduce it before calling >> collect(). >> >> On Fri, Dec 5, 2014 at 5:26 AM, david wrote: >> > hi, >> > >> > What is the bet way to process a batch window in SparkStreaming : >> > >> > kafkaStream.foreachRDD(rdd => { >> > rdd.collect().foreach(event => { >> > // process the event >> > process(event) >> > }) >> > }) >> > >> > >> > Or >> > >> > kafkaStream.foreachRDD(rdd => { >> > rdd.map(event => { >> > // process the event >> > process(event) >> > }).collect() >> > }) >> > >> > >> > thank's >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark streaming kafa best practices ?
Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. -kr, Gerard. On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell wrote: > > The second choice is better. Once you call collect() you are pulling > all of the data onto a single node, you want to do most of the > processing in parallel on the cluster, which is what map() will do. > Ideally you'd try to summarize the data or reduce it before calling > collect(). > > On Fri, Dec 5, 2014 at 5:26 AM, david wrote: > > hi, > > > > What is the bet way to process a batch window in SparkStreaming : > > > > kafkaStream.foreachRDD(rdd => { > > rdd.collect().foreach(event => { > > // process the event > > process(event) > > }) > > }) > > > > > > Or > > > > kafkaStream.foreachRDD(rdd => { > > rdd.map(event => { > > // process the event > > process(event) > > }).collect() > > }) > > > > > > thank's > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: spark streaming kafa best practices ?
The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 2014 at 5:26 AM, david wrote: > hi, > > What is the bet way to process a batch window in SparkStreaming : > > kafkaStream.foreachRDD(rdd => { > rdd.collect().foreach(event => { > // process the event > process(event) > }) > }) > > > Or > > kafkaStream.foreachRDD(rdd => { > rdd.map(event => { > // process the event > process(event) > }).collect() > }) > > > thank's > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark streaming kafa best practices ?
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd => { rdd.map(event => { // process the event process(event) }).collect() }) thank's -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafa-best-practices-tp20470.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org