"These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives."
Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using one or the other which reaches the same effect - likely at a different efficiency cost. The question is, what are the effects of applying one vs. the other? My specific scenario is, I'm streaming data out of Kafka. I want to perform a few transformations then apply an action which results in e.g. writing this data to Solr. According to Evo, my best bet is foreachPartition because of increased parallelism (which I'd need to grok to understand the details of what that means). Another scenario is, I've done a few transformations and send a result somewhere, e.g. I write a message into a socket. Let's say I have one socket per a client of my streaming app and I get a host:port of that socket as part of the message and want to send the response via that socket. Is foreachPartition still a better choice? On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote: > These are quite different operations. One operates on RDDs in DStream and > one operates on partitions of an RDD. They are not alternatives. > > On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg...@gmail.com> wrote: > >> Is there a set of best practices for when to use foreachPartition vs. >> foreachRDD? >> >> Is it generally true that using foreachPartition avoids some of the >> over-network data shuffling overhead? >> >> When would I definitely want to use one method vs. the other? >> >> Thanks. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>