caican00 opened a new pull request, #37899:
URL: https://github.com/apache/spark/pull/37899

   ### What changes were proposed in this pull request?
   
   Abort result stage directly when it failed caused by FetchFailedException.
   
   ### Why are the changes needed?
   
   Here's a very serious bug:
   When result stage failed caused by `FetchFailedException`,  the previous 
condition to determine whether result stage retries are allowed is 
`numMissingPartitions < resultStage.numTasks`. 
   
   If this condition holds on retry, but the other tasks at the current result 
stage are not killed, when result stage was resubmit, it would got wrong 
partitions to recalculation.
   ```
   // DAGScheduler#submitMissingTasks
    
   // Figure out the indexes of partition ids to compute.
   val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() 
   ```
   It is possible that the number of partitions to be recalculated is smaller 
than the actual number of partitions at result stage.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing tests and new test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to