mridulm commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974600104


##########
docs/configuration.md:
##########
@@ -2605,6 +2605,15 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>2.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.stage.attempt.ignoreOnDecommissionFetchFailure</code></td>

Review Comment:
   `spark.stage.attempt.ignoreOnDecommissionFetchFailure` -> 
`spark.stage.ignoreOnDecommissionFetchFailure`



##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1860,8 +1867,18 @@ private[spark] class DAGScheduler(
             s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
         } else {
           failedStage.failedAttemptIds.add(task.stageAttemptId)
+          val ignoreStageFailure = ignoreDecommissionFetchFailure &&
+            isExecutorDecommissioned(taskScheduler, bmAddress)
+          if (ignoreStageFailure) {
+            logInfo("Ignoring fetch failure from $task of $failedStage attempt 
" +
+              s"${task.stageAttemptId} when count 
spark.stage.maxConsecutiveAttempts " +
+              "as executor ${bmAddress.executorId} is decommissioned and " +
+              s" ${config.STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE.key}=true")
+          }
+
           val shouldAbortStage =
-            failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
+            (!ignoreStageFailure &&
+              failedStage.failedAttemptIds.size >= 
maxConsecutiveStageAttempts) ||
             disallowStageRetryForTest

Review Comment:
   QQ: We are preventing the immediate failure from aborting the stage, but 
might be effectively reducing the number of stage failures which can be 
tolerated ?
   
   For example:
   attempt 0, attempt 1, attempt 2 failed due to decommission
   attempt 3 failed for other reasons -> job failed (assuming 
maxConsecutiveStageAttempts = 4)
   
   Is this the behavior we will now exhibit ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to