[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-26 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments per partition on success.

With this pragmatic approach, increments from individual partitions / tasks are 
only merged into the accumulator on driver side for the first time per 
partition. This is useful for accumulators registered with {{countFailedValues 
== false}}. Hence, the accumulator aggregates all successful partitions only 
once.

The implementations require extra memory that scales with the number of 
partitions.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is useful for accumulators registered with 
{{countFailedValues == false}}.
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments per partition on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are only merged into the accumulator on driver side for the first time per 
> partition. This is useful for accumulators registered with 
> {{countFailedValues == false}}. Hence, the accumulator aggregates all 
> successful partitions only once.
> The implementations require extra memory that scales with the number of 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-18 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is useful for accumulators registered with 
{{countFailedValues == false}}.
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is only useful for accumulators registered with 
{{countFailedValues == false}}.
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - {{ALL}} sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - {{FIRST}} increment: allows to retrieve the first accumulator value for 
> each partition only. This is useful for accumulators registered with 
> {{countFailedValues == false}}.
>  - {{LARGEST}} over all increments of each partition: accumulators aggregate 
> multiple increments while a partition is processed, a 

[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-18 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is only useful for accumulators registered with 
{{countFailedValues == false}}.
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is only useful for accumulators registered with 
{{countFailedValues == false}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - {{ALL}} sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - {{FIRST}} increment: allows to retrieve the first accumulator value for 
> each partition only. This is only useful for accumulators registered with 
> {{countFailedValues == false}}.
>  - {{LARGEST}} over all increments of each partition: accumulators aggregate 
> multiple increments while a partition is 

[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-18 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code produces identical 
accumulator increments on success. Rerunning partitions for any reason should 
always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - {{ALL}} sums over all increments of each partition: this represents the 
current implementation of accumulators
 - {{LARGEST}} over all increments of each partition: accumulators aggregate 
multiple increments while a partition is processed, a successful task provides 
the most accumulated values that has always the largest cardinality than any 
accumulated value of failed tasks, hence it paramounts any failed task's value. 
This produces reliable accumulator values. This does not require 
{{countFailedValues == false}}. This should only be used in a single stage. The 
naming may be confused with {{MAX}}.
 - {{FIRST}} increment: allows to retrieve the first accumulator value for each 
partition only. This is only useful for accumulators registered with 
{{countFailedValues == false}}.

The implementations for {{LARGEST}} and {{FIRST}} require extra memory that 
scales with the number of partitions. The current {{ALL}} implementation does 
not require extra memory.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - ALL sums over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current ALL implementation does not require extra 
memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code produces identical 
> accumulator increments on success. Rerunning partitions for any reason should 
> always produce the same increments on success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - {{ALL}} sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - {{LARGEST}} over all increments of each partition: accumulators aggregate 
> multiple increments while a partition is processed, a successful task 
> provides the most accumulated values that has always the largest cardinality 
> than any accumulated value of failed tasks, hence it paramounts any failed 
> task's value. This produces reliable accumulator values. This does not 
> require {{countFailedValues == false}}. This should only be used in a single 
> stage. The naming may be confused with {{MAX}}.
>  - {{FIRST}} increment: allows to 

[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30666:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code incrementing accumulators 
> also produces identical accumulator increments on success. Rerunning 
> partitions for any reason should always produce the same increments on 
> success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - ALL sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only 
> increment while a partition is processed, a successful task provides an 
> accumulator value that is always larger than any value of failed tasks, hence 
> it paramounts any failed task's value. This produces reliable accumulator 
> values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition 
> only.
> The implementation for MAX and LAST requires extra memory that scales with 
> the number of partitions. The current ALL implementation does not require 
> extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Component/s: (was: SQL)
 Spark Core

> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code incrementing accumulators 
> also produces identical accumulator increments on success. Rerunning 
> partitions for any reason should always produce the same increments on 
> success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - ALL sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only 
> increment while a partition is processed, a successful task provides an 
> accumulator value that is always larger than any value of failed tasks, hence 
> it paramounts any failed task's value. This produces reliable accumulator 
> values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition 
> only.
> The implementation for MAX and LAST requires extra memory that scales with 
> the number of partitions. The current ALL implementation does not require 
> extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-01-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30666:
--
Description: 
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - ALL sums over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current ALL implementation does not require extra 
memory.

  was:
This proposes a pragmatic improvement to allow for reliable single-stage 
accumulators. Under the assumption that a given stage / partition / rdd 
produces identical results, non-deterministic code incrementing accumulators 
also produces identical accumulator increments on success. Rerunning partitions 
for any reason should always produce the same increments on success.

With this pragmatic approach, increments from individual partitions / tasks are 
compared to earlier increments. Depending on the strategy of how a new 
increment updates over an earlier increment from the same partition, different 
semantics of accumulators (here called accumulator modes) can be implemented:
 - SUM over all increments of each partition: this represents the current 
implementation of accumulators
 - MAX over all increments of each partition: assuming accumulators only 
increment while a partition is processed, a successful task provides an 
accumulator value that is always larger than any value of failed tasks, hence 
it paramounts any failed task's value. This produces reliable accumulator 
values. This should only be used in a single stage.
 - LAST increment: allows to retrieve the latest increment for each partition 
only.

The implementation for MAX and LAST requires extra memory that scales with the 
number of partitions. The current SUM implementation does not require extra 
memory.


> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code incrementing accumulators 
> also produces identical accumulator increments on success. Rerunning 
> partitions for any reason should always produce the same increments on 
> success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - ALL sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only 
> increment while a partition is processed, a successful task provides an 
> accumulator value that is always larger than any value of failed tasks, hence 
> it paramounts any failed task's value. This produces reliable accumulator 
> values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition 
> only.
> The implementation for MAX and LAST requires extra memory that scales with 
> the number of partitions. The current ALL implementation does not require 
> extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe,