And Lord Joe you were right future versions did protect accumulators in
actions. I wonder if anyone has a "modern" take on the accumulator vs.
aggregate question. Seems like if I need to do it by key or control
partitioning I would use aggregate.
Bottom line question / reason for post: I wonder if anyone has more ideas
about using aggregate instead? Am I right to think accumulables are always
present on the driver, whereas an aggregate needs to be pulled to the driver
manually?
Details:
But they both give me an option to write custom adds and merges:
For example this class I am stubbing out:
class DropEvalAccumulableParam implements
AccumulableParam<DropEvaluation, DropResult> {
// Add additional data to the accumulator value. Is allowed to
modify and return r for efficiency (to avoid allocating objects).
// r is the first value
@Override
public DropEvaluation addAccumulator(DropEvaluation dropEvaluation,
DropResult dropResult) {
return null;
}
// Merge two accumulated values together. Is allowed to modify and
return the first value for efficiency (to avoid allocating objects).
@Override
public DropEvaluation addInPlace(DropEvaluation masterDropEval,
DropEvaluation r1) {
return null;
}
// Return the "zero" (identity) value for an accumulator type, given
its initial value. For example, if R was a vector of N dimensions,
// this would return a vector of N zeroes.
@Override
public DropEvaluation zero(DropEvaluation dropEvaluation) {
// technically the "additive identity" of a DropEvaluation would
be
return dropEvaluation;
}
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p26456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]