[ https://issues.apache.org/jira/browse/SPARK-38973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mridul Muralidharan resolved SPARK-38973. ----------------------------------------- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36293 [https://github.com/apache/spark/pull/36293] > When push-based shuffle is enabled, a stage may not complete when retried > ------------------------------------------------------------------------- > > Key: SPARK-38973 > URL: https://issues.apache.org/jira/browse/SPARK-38973 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core > Affects Versions: 3.2.0 > Reporter: Chandni Singh > Assignee: Chandni Singh > Priority: Major > Fix For: 3.3.0 > > > With push-based shuffle enabled and adaptive merge finalization, there are > scenarios where a re-attempt of ShuffleMapStage may not complete. > With Adaptive Merge Finalization, a stage may be triggered for finalization > when it is in the below state: > # The stage is *not* running ({*}not{*} in the _running_ set of the > DAGScheduler) - had failed or canceled or waiting, and > # The stage has no pending partitions (all the tasks completed at-least > once). > For such a stage when the finalization completes, the stage will still not be > marked as {_}mergeFinalized{_}. > The stage of the stage will be: > * _stage.shuffleDependency.mergeFinalized = false_ > * _stage.shuffleDependency.getFinalizeTask = finalizeTask_ > * Merged statuses of the state are unregistered > > When the stage is resubmitted, the newer attempt of the stage will never > complete even though its tasks may be completed. This is because the newer > attempt of the stage will have {_}shuffleMergeEnabled = true{_}, since with > the previous attempt the stage was never marked as {_}mergedFinalized{_}, and > the _finalizeTask_ is present (from finalization attempt for previous stage > attempt). > > So, when all the tasks of the newer attempt complete, then these conditions > will be true: > * stage will be running > * There will be no pending partitions since all the tasks completed > * _stage.shuffleDependency.shuffleMergeEnabled = true_ > * _stage.shuffleDependency.shuffleMergeFinalized = false_ > * _stage.shuffleDependency.getFinalizeTask_ is not empty > This leads the DAGScheduler to try scheduling finalization and not trigger > the completion of the Stage. However because of the last condition it never > even schedules the finalization and the stage never completes. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org