subject:"\[jira\] \[Commented\] \(SPARK\-2666\) Always try to cancel running tasks when a stage is marked as zombie"

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387145#comment-15387145
 ] 

Lianhui Wang commented on SPARK-2666:
-

I think what [~irashid] said is more about non-external shuffle. But in our use 
cases we usually use Yarn-external shuffle service. for 0 --> 1 --> 2,  if it 
hitx a shuffle fetch failure while running stage 2, say on executor A. So it 
needs to regenerate the map output for stage 1 that was on executor A. But it 
don't rerun for stage 0 on executor A.
So i think we can firstly handle with FetchFailed on Yarn-external shuffle 
service(maybe connection timeout, out of memory, etc). I think many users have 
met FetchFailed on Yarn-external shuffle service.
as [~tgraves] said before, Now If the stages fails because FetchFailed, it 
rerun 1) all the ones not succeeded yet in the failed stage (including the ones 
that could still be running). So it cause many duplicate running tasks of 
failed stage. Once there is a FetchFailed, it will rerun all the unsuccessful 
tasks of the failed stage. 
Until now, i think our first target is for Yarn-external shuffle service if the 
stages fails because FetchFailed it should decrease the number of rerunning 
tasks of the failed stage. As i pointed out before that the best way is like 
Mapreduce we just resubmit the map stage of failed stage. 
1. When FetchFailed has happened on task, the task don't be finished and 
continue to fetch other results. It just report the ShuffleBlockId of 
FetchFailed to DAGScheduler. other running tasks of this stage did like this 
task.
2. DAGScheduler receive the ShuffleBlockId of FetchFailed and resubmit the task 
for the ShuffleBlockId. Once the task has been finished, it will register the 
map output to MapOutputTracker.
3. The task that has FetchFailed before get the map output of FetchFailed from 
MapOutputTracker every hearbeat. Once step-2 is finished. The task can get the 
map output of FetchFailed successfully and will fetch the results of 
FetchFailed.
But there is a dead lock if the tasks of Step-2 can not be run because there is 
no slots for it.Under this situation it should kill some running tasks for it. 
In addition, i find that https://issues.apache.org/jira/browse/SPARK-14649 did 
it for 2) it only run the failed ones and wait for the ones still running in 
failed stage. The disadvantage of SPARK-14649 is that other running tasks of 
the failed stage maybe need a long time to rerun when they spend time to fetch 
other's results.


> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassia

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386561#comment-15386561
 ] 

Thomas Graves commented on SPARK-2666:
--

thanks for the explanation.

I guess we would have to look through the failures cases, but if you are using 
the external shuffle service it feels like marking everything on that node as 
bad even if its from another executor would be better because this case seems 
like more of a node failure or something that would be much more likely to 
affect other map outputs.

I guess if its serving shuffle from the executor, it could just be something 
bad on that executor ( out of memory, timeout due to overload, etc). 



> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386524#comment-15386524
 ] 

Imran Rashid commented on SPARK-2666:
-

[~tgraves] [~lianhuiwang].  When there is a fetch failure, spark considers all 
shuffle output on that executor to be gone.  (The code is rather confusing -- 
first it just removes the one block with the fetch failed: 
https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134
  but just after that, it removes everything on the executor: 
https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184)


When a stage is retried, it reruns all the tasks for the missing shuffle 
outputs, at the time the stage is retried.  Usually, this is just all of the 
map output that was on the executor which had the fetch failed.  But, its not 
necessarily exactly the same, as even more shuffle outputs could be lost before 
the stage retry kicks in.

* Suppose you had three stages in a row, 0 --> 1 --> 2, and you hit a shuffle 
fetch failure while running stage 2, say on executor A.  So you need to 
regenerate the map output for stage 1 that was on executor A.  But most likely 
spark will discover that to regenerate that missing output, it needs some map 
output from stage 0, which was on executor A.  So first it will go re-run the 
missing parts of stage 0, and then when it gets to stage 1, the dag scheduler 
will look at what map outputs are beginning.  So there is some extra time in 
there to discover more missing shuffle outputs.

* Spark only marks the shuffle output as missing for the *executor* that 
shuffle data couldn't be read from, not for the entire node.  So if its a 
hardware failure, you're likely to hit more failures even after the first fetch 
failure comes in, since you probably can't read from any of the nodes on that 
host.

Despite this, I don't think there is a very good reason to leave tasks running 
after there is a fetch failure.  If there is a hardware failure, then the rest 
of the retry process is also likely to discover this and remove those executors 
as well.  (Kay and I had discussed this earlier in the thread and we seemed to 
agree, though I dunno if we had thought through all the details at that time.)  
If anything, I wonder if when there is a fetch failure, we should mark all data 
as missing on the entire node, not just the executor, but I don't think that is 
necessary.

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386235#comment-15386235
 ] 

Thomas Graves commented on SPARK-2666:
--

I think eventually adding prestart (MapReduce slowstart type setting) makes 
sense.  This is actually why I didn't change the mapoutput statuses to go along 
with task launch. I wanted to be able to do this or get incremental map output 
status results.

But as far as the keep the remaining tasks running I think it depends on the 
behavior and I haven't had time to go look in more detail.

If the stages fails, what tasks does it rerun:

1) does it rerun all the ones not succeeded yet in the failed stage (including 
the ones that could still be running)?  
2) does it only run the failed ones and wait for the ones still running in 
failed stage?  If they succeed it uses those results.

>From what I saw with this job I thought it was acting like number 1 above. The 
>only use to leave the ones running is to see if they get FetchFailures, this 
>seems like a lot of overhead to find that out if that task takes a long time.

When a fetch failure happens, does the schedule re-run all maps that had run on 
that node or just the ones specifically mentioned by the fetch failure?  Again 
I thought it was just the specific map that the fetch failure failed to get, 
thus why it needs to know if the other reducers get fetch failures.

I can kind of understand letting them run to see if they hit fetch failures as 
well but on a large job or with tasks that take a long time, if we aren't  
counting them as success then its more a waste of resources and just extends 
the job time as well as confuses the user since the UI doesn't represent those 
still running.

 In the case i was seeing my tasks took roughly an hour.  One stage failed so 
it restarted that stage, but since it didn't kill the tasks from the original 
stage it had very few executors open to run new ones, thus the job took a lot 
longer then it should.   I don't remember the exact cause of the failures 
anymore.

Anyway I think the results are going to vary a lot based on the type of job and 
length of each stage (map vs reduce). 

personally I think it would be better to change to fail all maps that ran on 
the host it failed to fetch from and kill the rest of the running reducers in 
that stage. But I would have to investigate the code more to fully understand.



> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubs

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-07-20 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385510#comment-15385510
 ] 

Lianhui Wang commented on SPARK-2666:
-

[~tgraves] Sorry for late reply. In https://github.com/apache/spark/pull/1572, 
it will kill all running tasks before we resubmit for FetchFailed. But 
[~kayousterhout] said that  it keep the remaining tasks because the running 
tasks may hit Fetch failures from different map outputs than the original fetch 
failure. 
I think the best way is like Mapreduce we just resubmit the map stage of failed 
stage. if the reduce stage has a FetchFailed, it just report FetchFailed to 
DAGScheduler and fetch other results. Then the reduce stage getOutputStatus of 
FetchFailed every hearbeat like https://github.com/apache/spark/pull/3430.
[~tgraves] How about your ideas about this? Thanks. 

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2016-03-02 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176149#comment-15176149
 ] 

Thomas Graves commented on SPARK-2666:
--

[~lianhuiwang] were you going to work on this?  I'm running into this and I 
think its a bad idea to keep running the old tasks.  It all depends on what and 
how long those tasks are running.  In my case those tasks run a very long time 
doing an expensive shuffle. We should kill those tasks immediately to allow 
tasks from the newer retry Stage to run.

Did you run into issues with your pr or just needed rebase?

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

2015-09-22 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902459#comment-14902459
 ] 

Lianhui Wang commented on SPARK-2666:
-

[~imranr] thanks, i have take a look at https://github.com/squito/spark/pull/4. 
And i think that's logic is right. it is ok except unit test.

> Always try to cancel running tasks when a stage is marked as zombie
> ---
>
> Key: SPARK-2666
> URL: https://issues.apache.org/jira/browse/SPARK-2666
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Lianhui Wang
>
> There are some situations in which the scheduler can mark a task set as a 
> "zombie" before the task set has completed all of its tasks.  For example:
> (a) When a task fails b/c of a {{FetchFailed}}
> (b) When a stage completes because two different attempts create all the 
> ShuffleMapOutput, though no attempt has completed all its tasks (at least, 
> this *should* result in the task set being marked as zombie, see SPARK-10370)
> (there may be others, I'm not sure if this list is exhaustive.)
> Marking a taskset as zombie prevents any *additional* tasks from getting 
> scheduled, however it does not cancel all currently running tasks.  We should 
> cancel all running to avoid wasting resources (and also to make the behavior 
> a little more clear to the end user).  Rather than canceling tasks in each 
> case piecemeal, we should refactor the scheduler so that these two actions 
> are always taken together -- canceling tasks should go hand-in-hand with 
> marking the taskset as zombie.
> Some implementation notes:
> * We should change {{taskSetManager.isZombie}} to be private and put it 
> behind a method like {{markZombie}} or something.
> * marking a stage as zombie before the all tasks have completed does *not* 
> necessarily mean the stage attempt has failed.  In case (a), the stage 
> attempt has failed, but in stage (b) we are not canceling b/c of a failure, 
> rather just b/c no more tasks are needed.
> * {{taskScheduler.cancelTasks}} always marks the task set as zombie.  
> However, it also has some side-effects like logging that the stage has failed 
> and creating a {{TaskSetFailed}} event, which we don't want eg. in case (b) 
> when nothing has failed.  So it may need some additional refactoring to go 
> along w/ {{markZombie}}.
> * {{SchedulerBackend}}'s are free to not implement {{killTask}}, so we need 
> to be sure to catch the {{UnsupportedOperationException}} s
> * Testing this *might* benefit from SPARK-10372



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

[jira] [Commented] (SPARK-2666) Always try to cancel running tasks when a stage is marked as zombie

7 matches

Site Navigation

Mail list logo

Footer information