foreach plus accumulator Vs mapPartitions performance

ben Thu, 21 May 2015 14:47:35 -0700

Hi, everybody.

There are some cases in which I can obtain the same results by using the
mapPartitions and the foreach method.


For example in a typical MapReduce approach one would perform a reduceByKey
immediately after a mapPartitions that transform the original RDD in a
collection of tuple (key, value). I think that is possible to achieve the
same result by using, for instance an array of accumulator where at each
index an executor sums a value and the index itself could be a key.

Since the reduceByKey will perform a shuffle on disk I think that when is
possible, the foreach approach should be better even though the foreach has
the side effect of sum a value to an accumulator.

I am making this request to see if my reasoning is correct . I hope I was
clear.
Thank you, Beniamino



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/foreach-plus-accumulator-Vs-mapPartitions-performance-tp22982.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

foreach plus accumulator Vs mapPartitions performance

Reply via email to