Re: How does the Spark Accumulator work under the covers?

2014-10-10 Thread HARIPRIYA AYYALASOMAYAJULA
If you use parallelize, the data is distributed across multiple nodes
available and sum is computed individually within each partition and later
merged. The driver manages the entire process. Is my understanding correct?
Can someone please correct me if I am wrong?

On Fri, Oct 10, 2014 at 9:37 AM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -) 
abaghdasa...@bloomberg.net wrote:

 Hello,
 I was wondering on what does the Spark accumulator do under the covers.
 I’ve implemented my own associative addInPlace function for the
 accumulator, where is this function being run? Let’s say you call something
 like myRdd.map(x = sum += x) is “sum” being accumulated locally in any
 way, for each element or partition or node? Is “sum” a broadcast variable?
 Or does it only exist on the driver node? How does the driver node get
 access to the “sum”?
 Thanks,
 Areg




-- 
Regards,
Haripriya Ayyalasomayajula


Re: How does the Spark Accumulator work under the covers?

2014-10-10 Thread Jayant Shekhar
Hi Areg,

Check out
http://spark.apache.org/docs/latest/programming-guide.html#accumulators

val sum = sc.accumulator(0)   // accumulator created from an initial value
in the driver

The accumulator variable is created in the driver. Tasks running on the
cluster can then add to it. However, they cannot read its value. Only the
driver program can read the accumulator’s value, using its value method.

sum.value  // in the driver

 myRdd.map(x = sum += x)
 where is this function being run
This is being run by the tasks in the workers.

The driver accumulates the data from the various workers and mergers them
to get the final result as Haripriya mentioned.

Thanks,
Jayant


On Fri, Oct 10, 2014 at 7:46 AM, HARIPRIYA AYYALASOMAYAJULA 
aharipriy...@gmail.com wrote:

 If you use parallelize, the data is distributed across multiple nodes
 available and sum is computed individually within each partition and later
 merged. The driver manages the entire process. Is my understanding correct?
 Can someone please correct me if I am wrong?

 On Fri, Oct 10, 2014 at 9:37 AM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
 abaghdasa...@bloomberg.net wrote:

 Hello,
 I was wondering on what does the Spark accumulator do under the covers.
 I’ve implemented my own associative addInPlace function for the
 accumulator, where is this function being run? Let’s say you call something
 like myRdd.map(x = sum += x) is “sum” being accumulated locally in any
 way, for each element or partition or node? Is “sum” a broadcast variable?
 Or does it only exist on the driver node? How does the driver node get
 access to the “sum”?
 Thanks,
 Areg




 --
 Regards,
 Haripriya Ayyalasomayajula