Enrico Minack created SPARK-30666:
-------------------------------------

             Summary: Reliable single-stage accumulators
                 Key: SPARK-30666
                 URL: https://issues.apache.org/jira/browse/SPARK-30666
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Enrico Minack


This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - SUM over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current SUM implementation does not require extra 
memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to