You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 <https://issues.apache.org/jira/browse/SPARK-732> for more info.
On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L < [email protected]> wrote: > Hi All. > > > > I am trying to get my head around why using accumulators and accumulables > seems to be the most recommended method for accumulating running sums, > averages, variances and the like, whereas the aggregate method seems to me > to be the right one. I have no performance measurements as of yet, but it > seems that aggregate is simpler and more intuitive (And it does what one > might expect an accumulator to do) whereas the accumulators and > accumulables seem to have some extra complications and overhead. > > > > So… > > > > What’s the real difference between an accumulator/accumulable and > aggregating an RDD? When is one method of aggregation preferred over the > other? > > > > Thanks, > > Nate > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: [email protected] W: www.velos.io
