"These are quite different operations. One operates on RDDs in  DStream and
one operates on partitions of an RDD. They are not alternatives."

Sean, different operations as they are, they can certainly be used on the
same data set.  In that sense, they are alternatives. Code can be written
using one or the other which reaches the same effect - likely at a
different efficiency cost.

The question is, what are the effects of applying one vs. the other?

My specific scenario is, I'm streaming data out of Kafka.  I want to
perform a few transformations then apply an action which results in e.g.
writing this data to Solr.  According to Evo, my best bet is
foreachPartition because of increased parallelism (which I'd need to grok
to understand the details of what that means).

Another scenario is, I've done a few transformations and send a result
somewhere, e.g. I write a message into a socket.  Let's say I have one
socket per a client of my streaming app and I get a host:port of that
socket as part of the message and want to send the response via that
socket.  Is foreachPartition still a better choice?








On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:

> These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives.
>
> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> wrote:
>
>> Is there a set of best practices for when to use foreachPartition vs.
>> foreachRDD?
>>
>> Is it generally true that using foreachPartition avoids some of the
>> over-network data shuffling overhead?
>>
>> When would I definitely want to use one method vs. the other?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to