And Lord Joe you were right future versions did protect accumulators in actions. I wonder if anyone has a "modern" take on the accumulator vs. aggregate question. Seems like if I need to do it by key or control partitioning I would use aggregate.
Bottom line question / reason for post: I wonder if anyone has more ideas about using aggregate instead? Am I right to think accumulables are always present on the driver, whereas an aggregate needs to be pulled to the driver manually? Details: But they both give me an option to write custom adds and merges: For example this class I am stubbing out: class DropEvalAccumulableParam implements AccumulableParam<DropEvaluation, DropResult> { // Add additional data to the accumulator value. Is allowed to modify and return r for efficiency (to avoid allocating objects). // r is the first value @Override public DropEvaluation addAccumulator(DropEvaluation dropEvaluation, DropResult dropResult) { return null; } // Merge two accumulated values together. Is allowed to modify and return the first value for efficiency (to avoid allocating objects). @Override public DropEvaluation addInPlace(DropEvaluation masterDropEval, DropEvaluation r1) { return null; } // Return the "zero" (identity) value for an accumulator type, given its initial value. For example, if R was a vector of N dimensions, // this would return a vector of N zeroes. @Override public DropEvaluation zero(DropEvaluation dropEvaluation) { // technically the "additive identity" of a DropEvaluation would be return dropEvaluation; } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p26456.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org