[ https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862248#comment-15862248 ]
Apache Spark commented on SPARK-19560: -------------------------------------- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/16892 > Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask > from a failed executor > ------------------------------------------------------------------------------------------------ > > Key: SPARK-19560 > URL: https://issues.apache.org/jira/browse/SPARK-19560 > Project: Spark > Issue Type: Test > Components: Scheduler > Affects Versions: 2.1.1 > Reporter: Kay Ousterhout > Assignee: Kay Ousterhout > Priority: Minor > > There's some tricky code around the case when the DAGScheduler learns of a > ShuffleMapTask that completed successfully, but ran on an executor that > failed sometime after the task was launched. This case is tricky because the > TaskSetManager (i.e., the lower level scheduler) thinks the task completed > successfully, but the DAGScheduler considers the output it generated to be no > longer valid (because it was probably lost when the executor was lost). As a > result, the DAGScheduler needs to re-submit the stage, so that the task can > be re-run. This is tested in some of the tests but not clearly documented, > so we should improve this to prevent future bugs (this was encountered by > [~markhamstra] in attempting to find a better fix for SPARK-19263). -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org