@Evo There is no foreachRDD operation on RDDs; it is a method of DStream. It gives each RDD in the stream. RDD has a foreach, and foreachPartition. These give elements of an RDD. What do you mean it 'works' to call foreachRDD on an RDD?
@Dmitry are you asking about foreach vs foreachPartition? that's quite different. foreachPartition does not give more parallelism but lets you operate on a whole batch of data at once, which is nice if you need to allocate some expensive resource to do the processing. On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > "These are quite different operations. One operates on RDDs in DStream and > one operates on partitions of an RDD. They are not alternatives." > > Sean, different operations as they are, they can certainly be used on the > same data set. In that sense, they are alternatives. Code can be written > using one or the other which reaches the same effect - likely at a different > efficiency cost. > > The question is, what are the effects of applying one vs. the other? > > My specific scenario is, I'm streaming data out of Kafka. I want to perform > a few transformations then apply an action which results in e.g. writing > this data to Solr. According to Evo, my best bet is foreachPartition > because of increased parallelism (which I'd need to grok to understand the > details of what that means). > > Another scenario is, I've done a few transformations and send a result > somewhere, e.g. I write a message into a socket. Let's say I have one > socket per a client of my streaming app and I get a host:port of that socket > as part of the message and want to send the response via that socket. Is > foreachPartition still a better choice? > > > > > > > > > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote: >> >> These are quite different operations. One operates on RDDs in DStream and >> one operates on partitions of an RDD. They are not alternatives. >> >> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> wrote: >>> >>> Is there a set of best practices for when to use foreachPartition vs. >>> foreachRDD? >>> >>> Is it generally true that using foreachPartition avoids some of the >>> over-network data shuffling overhead? >>> >>> When would I definitely want to use one method vs. the other? >>> >>> Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org