[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-26 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182095#comment-16182095
 ] 

Li Yuanjian commented on SPARK-22074:
-

Yes, that's right.

> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 2.0 in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, 
> executor 26, partition 79, PROCESS_LOCAL, 6237 bytes)
> [5] ShuffleMapStage 1046.1 still running, the attempted task killed by other 
> trigger the Resubmitted event
> 416646:17/09/11 13:47:01 [dispatcher-event-loop-2

[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182062#comment-16182062
 ] 

Saisai Shao commented on SPARK-22074:
-

So if I understand correctly, this happens when speculation is happened, if one 
task attempt is finished (66.1), it will try to kill all other attempts (66.0), 
but before this attempt (66.0) is fully killed, the executor who run this 
attempt is lost, so scheduler will resubmit this attempt because of executor 
lost, and neglect other successful attempt, Am I right?



> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13

[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-26 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181943#comment-16181943
 ] 

Li Yuanjian commented on SPARK-22074:
-

Hi [~jerryshao] saisai, the 66.0 resubmitted because of its executor lost 
during 1046.1 running. I also reproduce this in the UT added in my patch and 
add detailed scenario description in comment, it will fail without the changes 
in this PR and will pass conversely. Could you help me check the UT recreate 
the scenario right? Thanks a lot. :)

> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13:46:59 [dispatcher-event-

[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181924#comment-16181924
 ] 

Saisai Shao commented on SPARK-22074:
-

Hey [~XuanYuan], I'm a little confused why there will be a resubmit event after 
66.0 is killed, since this killing action is expected and Spark should not 
launch another attempt.

> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 2.0 in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, 
> executor 26, partition 79, PROCESS_LOCAL, 6237 bytes)
>

[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-26 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180652#comment-16180652
 ] 

Li Yuanjian commented on SPARK-22074:
-

Hi [~jerryshao], thanks for you comment. 
In my scenario, the 66.0 is truly killed by 66.1, the root case cause 1046.0 
fail to finish is that the resubmitted event of task 66.0(killed by 66.1before) 
reached while 1046.1 running.

> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 2.0 in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, 
> execu

[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16180506#comment-16180506
 ] 

Saisai Shao commented on SPARK-22074:
-

Hi [~XuanYuan], can you please help me to understand your scenario, is it 
happened only when task attempt (66.0) is lost (which will be adding to pending 
list), at this time another attempt (66.1) is finished, it will try to kill 
66.0, but because 66.0 is pending for resubmitting, so it is not truly killed,  
so attempt 66.0 is lingering in the stage 1046.0, which makes 1046 fail to 
finish, do I understand right?

Can you please explain more if my assumption is wrong.

> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-2016

[jira] [Commented] (SPARK-22074) Task killed by other attempt task should not be resubmitted

2017-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172772#comment-16172772
 ] 

Apache Spark commented on SPARK-22074:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/19287

> Task killed by other attempt task should not be resubmitted
> ---
>
> Key: SPARK-22074
> URL: https://issues.apache.org/jira/browse/SPARK-22074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Li Yuanjian
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 39:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 2.0 in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, 
> executor 26, partition 79, PROCESS_LOCAL, 6237 bytes)
> [5] ShuffleMapStage 1046.1 still running, the attempted task killed by o