[ 
https://issues.apache.org/jira/browse/SPARK-22074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22074.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0

Issue resolved by pull request 19287
[https://github.com/apache/spark/pull/19287]

> Task killed by other attempt task should not be resubmitted
> -----------------------------------------------------------
>
>                 Key: SPARK-22074
>                 URL: https://issues.apache.org/jira/browse/SPARK-22074
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.1.0, 2.2.0
>            Reporter: Li Yuanjian
>              Labels: speculation
>             Fix For: 2.3.0
>
>
> When a task killed by other task attempt, the task still resubmitted while 
> its executor lost. There is a certain probability caused the stage hanging 
> forever because of the unnecessary resubmit(see the scenario description 
> below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931 
> can resolve the hanging problem(thx [~GavinGavinNo1] :) ), but the 
> unnecessary resubmit should abandon.
> Detail scenario description:
> 1. A ShuffleMapStage has many tasks, some of them finished successfully
> 2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, 
> includes all missing partitions.
> 3. Before the resubmitted TaskSet completed, another executor which only 
> include the task killed by other attempt lost, trigger the Resubmitted Event, 
> current stage's pendingPartitions is not empty.
> 4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but 
> pendingPartitions is not empty, never step into submitWaitingChildStages.
> Leave the key logs of this scenario below:
> {noformat}
> 393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 120 missing tasks from ShuffleMapStage 1046 
> (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
> 393333:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.0 with 120 tasks
> 408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: 
> Starting task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, 
> executor 15, partition 66, PROCESS_LOCAL, 6237 bytes)
> [1] Executor 15 lost, task 66.0 and 90.0 on it
> 410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 15.
> 410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: 
> Starting task 66.1 in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, 
> executor 70, partition 66, PROCESS_LOCAL, 6237 bytes)
> [2] Task 66.0 killed by 66.1
> 411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing 
> attempt 0 for task 66.0 in stage 1046.0 (TID 110761) on 
> hidden-baidu-host.baidu.com as the attempt 1 succeeded on 
> hidden-baidu-host.baidu.com
> 411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished 
> task 66.1 in stage 1046.0 (TID 111400) in 3545 ms on 
> hidden-baidu-host.baidu.com (executor 70) (115/120)
> [3] Executor 7 lost, task 0.0 72.0 7.0 on it
> 411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO 
> YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 7.
> 416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) finished in 94.577 s
> [4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 
> 1046.1
> 416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: 
> Resubmitting ShuffleMapStage 1046 (rdd at AFDEntry.scala:116) because some of 
> its tasks had failed: 0, 72, 79
> 416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at 
> AFDEntry.scala:116), which has no missing parents
> 416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Submitting 3 missing tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] 
> at rdd at AFDEntry.scala:116)
> 416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO 
> YarnClusterScheduler: Adding task set 1046.1 with 3 tasks
> 416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: 
> Starting task 0.0 in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, 
> executor 37, partition 0, PROCESS_LOCAL, 6237 bytes)
> 416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 1.0 in stage 1046.1 (TID 112789, 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor 69, partition 
> 72, PROCESS_LOCAL, 6237 bytes)
> 416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: 
> Starting task 2.0 in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, 
> executor 26, partition 79, PROCESS_LOCAL, 6237 bytes)
> [5] ShuffleMapStage 1046.1 still running, the attempted task killed by other 
> trigger the Resubmitted event
> 416646:17/09/11 13:47:01 [dispatcher-event-loop-26] WARN TaskSetManager: Lost 
> task 66.0 in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, executor 
> 15): ExecutorLostFailure (executor 15 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1502719603300_158941_01_104857616 on host: 
> hidden-baidu-host.baidu.com. Exit status: -100. Diagnostics: Container 
> released on a *lost* node
> 416647:17/09/11 13:47:01 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Resubmitted ShuffleMapTask(1046, 66), so marking it as still running
> 416648:17/09/11 13:47:01 [dispatcher-event-loop-26] WARN TaskSetManager: Lost 
> task 90.0 in stage 1046.0 (TID 110788, hidden-baidu-host.baidu.com, executor 
> 15): ExecutorLostFailure (executor 15 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1502719603300_158941_01_104857616 on host: 
> hidden-baidu-host.baidu.com. Exit status: -100. Diagnostics: Container 
> released on a *lost* node
> 416649:17/09/11 13:47:01 [dag-scheduler-event-loop] INFO DAGScheduler: 
> Resubmitted ShuffleMapTask(1046, 90), so marking it as still running
> 417197:17/09/11 13:47:02 [task-result-getter-0] INFO TaskSetManager: Finished 
> task 0.0 in stage 1046.1 (TID 112788) in 3027 ms on 
> hidden-baidu-host.baidu.com (executor 37) (1/3)
> 417206:17/09/11 13:47:02 [task-result-getter-1] INFO TaskSetManager: Finished 
> task 1.0 in stage 1046.1 (TID 112789) in 3106 ms on 
> yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com (executor 69) (2/3)
> 417383:17/09/11 13:47:03 [task-result-getter-0] INFO TaskSetManager: Finished 
> task 2.0 in stage 1046.1 (TID 112790) in 3634 ms on 
> hidden-baidu-host.baidu.com (executor 26) (3/3)
> 417384:17/09/11 13:47:03 [task-result-getter-0] INFO YarnClusterScheduler: 
> Removed TaskSet 1046.1, whose tasks have all completed, from pool
> [6] Task 1046.1 success, but 1046 stage forever in running queue
> 417817:17/09/11 13:47:06 [task-result-getter-2] INFO YarnClusterScheduler: 
> Removed TaskSet 1046.0, whose tasks have all completed, from pool
> 417872:17/09/11 13:47:06 [dag-scheduler-event-loop] INFO DAGScheduler: 
> running: Set(ShuffleMapStage 1090, ResultStage 1069, ShuffleMapStage 1070, 
> ShuffleMapStage 1113, ShuffleMapStage 1092, ShuffleMapStage 1063, 
> ShuffleMapStage 1086, ShuffleMapStage 1065, ShuffleMapStage 1109, 
> ShuffleMapStage 1088, ShuffleMapStage 1111, ShuffleMapStage 1082, ResultStage 
> 1104, ShuffleMapStage 1105, ShuffleMapStage 1084, ShuffleMapStage 1076, 
> ShuffleMapStage 1107, ShuffleMapStage 1099, ShuffleMapStage 1078, 
> ShuffleMapStage 1101, ShuffleMapStage 1080, ShuffleMapStage 1072, 
> ShuffleMapStage 1094, ResultStage 1103, ShuffleMapStage 1074, ResultStage 
> 1096, ShuffleMapStage 1067, ShuffleMapStage 1046, ShuffleMapStage 1097)
> ......
> 598792:17/09/11 13:58:55 [dag-scheduler-event-loop] INFO DAGScheduler: 
> running: Set(ShuffleMapStage 1577, ShuffleMapStage 1548, ShuffleMapStage 
> 1571, ShuffleMapStage 1550, ShuffleMapStage 1543, ShuffleMapStage 1573, 
> ShuffleMapStage 1065, ShuffleMapStage 1565, ResultStage 1545, ShuffleMapStage 
> 1567, ShuffleMapStage 1546, ShuffleMapStage 1539, ShuffleMapStage 1569, 
> ShuffleMapStage 1561, ShuffleMapStage 1541, ShuffleMapStage 1533, 
> ShuffleMapStage 1563, ResultStage 1556, ShuffleMapStage 1535, ShuffleMapStage 
> 1557, ShuffleMapStage 1579, ShuffleMapStage 1537, ShuffleMapStage 1559, 
> ShuffleMapStage 1581, ShuffleMapStage 1552, ShuffleMapStage 1531, 
> ShuffleMapStage 1575, ShuffleMapStage 1554, ShuffleMapStage 1046)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to