[jira] [Updated] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler

Kay Ousterhout (JIRA) Tue, 07 Feb 2017 12:22:01 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kay Ousterhout updated SPARK-19502:
-----------------------------------
    Description: 
There are a [few lines of code in the 
DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
 to re-submit shuffle map stages when some of the tasks fail.  My understanding 
is that there should be a 1:1 mapping between pending tasks (which are tasks 
that haven't completed successfully) and available output locations, so that 
code should never be reachable.  Furthermore, the approach taken by that code 
(to re-submit an entire stage as a result of task failures) is not how we 
handle task failures in a stage (the lower-level scheduler resubmits the 
individual tasks) which is what the 5-years-old TODO on that code seems to be 
implying should be done.

The big caveat is that there's a bug being fixed in SPARK-19263 that means 
there is *not* a 1:1 relationship between pendingTasks and available 
outputLocations, so that code is serving as a (buggy) band-aid.  This should be 
fixed once we resolve SPARK-19263.

cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you 
see any reason we actually do need that code)


  was:
There are a [few lines of code in the 
DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
 to re-submit shuffle map stages when some of the tasks fail.  My understanding 
is that there should be a 1:1 mapping between pending tasks (which are tasks 
that haven't completed successfully) and available output locations, so that 
code should never be reachable.  Furthermore, the approach taken by that code 
(to re-submit an entire stage as a result of task failures) is not how we 
handle task failures in a stage (the lower-level scheduler resubmits the 
individual tasks) which is what the 5-years-old TODO on that code seems to be 
implying should be done.

The big caveat is that there's a bug being fixed in SPARK-19263 that means 
there is *not* a 1:1 relationship between pendingTasks and available 
outputLocations, so that code is serving as a (buggy) band-aid.  This should be 
fixed once we resolve SPARK-19263.

cc [~imranr] [~markhamstra] [~jinxing6...@126.com]


> Remove unnecessary code to re-submit stages in the DAGScheduler
> ---------------------------------------------------------------
>
>                 Key: SPARK-19502
>                 URL: https://issues.apache.org/jira/browse/SPARK-19502
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.1.1
>            Reporter: Kay Ousterhout
>            Assignee: Kay Ousterhout
>            Priority: Minor
>
> There are a [few lines of code in the 
> DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215)
>  to re-submit shuffle map stages when some of the tasks fail.  My 
> understanding is that there should be a 1:1 mapping between pending tasks 
> (which are tasks that haven't completed successfully) and available output 
> locations, so that code should never be reachable.  Furthermore, the approach 
> taken by that code (to re-submit an entire stage as a result of task 
> failures) is not how we handle task failures in a stage (the lower-level 
> scheduler resubmits the individual tasks) which is what the 5-years-old TODO 
> on that code seems to be implying should be done.
> The big caveat is that there's a bug being fixed in SPARK-19263 that means 
> there is *not* a 1:1 relationship between pendingTasks and available 
> outputLocations, so that code is serving as a (buggy) band-aid.  This should 
> be fixed once we resolve SPARK-19263.
> cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you 
> see any reason we actually do need that code)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler

Reply via email to