Hi, everybody. There are some cases in which I can obtain the same results by using the mapPartitions and the foreach method.
For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transform the original RDD in a collection of tuple (key, value). I think that is possible to achieve the same result by using, for instance an array of accumulator where at each index an executor sums a value and the index itself could be a key. Since the reduceByKey will perform a shuffle on disk I think that when is possible, the foreach approach should be better even though the foreach has the side effect of sum a value to an accumulator. I am making this request to see if my reasoning is correct . I hope I was clear. Thank you, Beniamino -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreach-plus-accumulator-Vs-mapPartitions-performance-tp22982.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org