@Evo There is no foreachRDD operation on RDDs; it is a method of
DStream. It gives each RDD in the stream. RDD has a foreach, and
foreachPartition. These give elements of an RDD. What do you mean it
'works' to call foreachRDD on an RDD?

@Dmitry are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing.

On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
<dgoldenberg...@gmail.com> wrote:
> "These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives."
>
> Sean, different operations as they are, they can certainly be used on the
> same data set.  In that sense, they are alternatives. Code can be written
> using one or the other which reaches the same effect - likely at a different
> efficiency cost.
>
> The question is, what are the effects of applying one vs. the other?
>
> My specific scenario is, I'm streaming data out of Kafka.  I want to perform
> a few transformations then apply an action which results in e.g. writing
> this data to Solr.  According to Evo, my best bet is foreachPartition
> because of increased parallelism (which I'd need to grok to understand the
> details of what that means).
>
> Another scenario is, I've done a few transformations and send a result
> somewhere, e.g. I write a message into a socket.  Let's say I have one
> socket per a client of my streaming app and I get a host:port of that socket
> as part of the message and want to send the response via that socket.  Is
> foreachPartition still a better choice?
>
>
>
>
>
>
>
>
> On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> These are quite different operations. One operates on RDDs in  DStream and
>> one operates on partitions of an RDD. They are not alternatives.
>>
>>
>> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> wrote:
>>>
>>> Is there a set of best practices for when to use foreachPartition vs.
>>> foreachRDD?
>>>
>>> Is it generally true that using foreachPartition avoids some of the
>>> over-network data shuffling overhead?
>>>
>>> When would I definitely want to use one method vs. the other?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to