[jira] [Commented] (SPARK-38973) When push-based shuffle is enabled, a stage may not complete when retried

2023-03-20 Thread Li Ying (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702556#comment-17702556
 ] 

Li Ying commented on SPARK-38973:
-

[~csingh] Should this bugfix be merged into 3.2.x branches?

> When push-based shuffle is enabled, a stage may not complete when retried
> -
>
> Key: SPARK-38973
> URL: https://issues.apache.org/jira/browse/SPARK-38973
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.3.0
>
>
> With push-based shuffle enabled and adaptive merge finalization, there are 
> scenarios where a re-attempt of ShuffleMapStage may not complete. 
> With Adaptive Merge Finalization, a stage may be triggered for finalization 
> when it is in the below state:
>  # The stage is *not* running ({*}not{*} in the _running_ set of the 
> DAGScheduler) - had failed or canceled or waiting, and
>  # The stage has no pending partitions (all the tasks completed at-least 
> once).
> For such a stage when the finalization completes, the stage will still not be 
> marked as {_}mergeFinalized{_}. 
> The stage of the stage will be: 
>  * _stage.shuffleDependency.mergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask = finalizeTask_
>  * Merged statuses of the state are unregistered
>  
> When the stage is resubmitted, the newer attempt of the stage will never 
> complete even though its tasks may be completed. This is because the newer 
> attempt of the stage will have {_}shuffleMergeEnabled = true{_}, since with 
> the previous attempt the stage was never marked as {_}mergedFinalized{_}, and 
> the _finalizeTask_ is present (from finalization attempt for previous stage 
> attempt).
>  
> So, when all the tasks of the newer attempt complete, then these conditions 
> will be true:
>  * stage will be running
>  * There will be no pending partitions since all the tasks completed
>  * _stage.shuffleDependency.shuffleMergeEnabled = true_
>  * _stage.shuffleDependency.shuffleMergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask_ is not empty
> This leads the DAGScheduler to try scheduling finalization and not trigger 
> the completion of the Stage. However because of the last condition it never 
> even schedules the finalization and the stage never completes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38973) When push-based shuffle is enabled, a stage may not complete when retried

2022-04-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525402#comment-17525402
 ] 

Apache Spark commented on SPARK-38973:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/36293

> When push-based shuffle is enabled, a stage may not complete when retried
> -
>
> Key: SPARK-38973
> URL: https://issues.apache.org/jira/browse/SPARK-38973
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Priority: Major
>
> With push-based shuffle enabled and adaptive merge finalization, there are 
> scenarios where a re-attempt of ShuffleMapStage may not complete. 
> With Adaptive Merge Finalization, a stage may be triggered for finalization 
> when it is in the below state:
>  # The stage is *not* running ({*}not{*} in the _running_ set of the 
> DAGScheduler) - had failed or canceled or waiting, and
>  # The stage has no pending partitions (all the tasks completed at-least 
> once).
> For such a stage when the finalization completes, the stage will still not be 
> marked as {_}mergeFinalized{_}. 
> The stage of the stage will be: 
>  * _stage.shuffleDependency.mergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask = finalizeTask_
>  * Merged statuses of the state are unregistered
>  
> When the stage is resubmitted, the newer attempt of the stage will never 
> complete even though its tasks may be completed. This is because the newer 
> attempt of the stage will have {_}shuffleMergeEnabled = true{_}, since with 
> the previous attempt the stage was never marked as {_}mergedFinalized{_}, and 
> the _finalizeTask_ is present (from finalization attempt for previous stage 
> attempt).
>  
> So, when all the tasks of the newer attempt complete, then these conditions 
> will be true:
>  * stage will be running
>  * There will be no pending partitions since all the tasks completed
>  * _stage.shuffleDependency.shuffleMergeEnabled = true_
>  * _stage.shuffleDependency.shuffleMergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask_ is not empty
> This leads the DAGScheduler to try scheduling finalization and not trigger 
> the completion of the Stage. However because of the last condition it never 
> even schedules the finalization and the stage never completes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org