[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-23 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880279#comment-15880279
 ] 

xukun edited comment on SPARK-18113 at 2/23/17 12:35 PM:
-

[~jinxing6042[~jinxing6...@126.com]
When appmaster get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while.
Yes, I think this is the root cause of this bug, which can be fixed by 
https://github.com/palantir/spark/pull/94.

Sorry, I don't understand. 
my steps is :
1.task 678.0 -> AskPermissionToCommitOutput
2.task 678.0 -> InformCommitDone
3.task 678.1 -> AskPermissionToCommitOutput
4.task 678.1 -> InformCommitDone
I think it is not fixed by https://github.com/palantir/spark/pull/94. Can u 
show me the step? 
Thanks!


was (Author: xukun):
[~jinxing6042[~jinxing6...@126.com]
When appmaster get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while.
Yes, I think this is the root cause of this bug, which can be fixed by 
https://github.com/palantir/spark/pull/94.

Sorry, I don't understand. 
my steps is :
1.task 678.0 -> AskPermissionToCommitOutput
2.task 678.0 -> InformCommitDone
3.task 678.1 -> AskPermissionToCommitOutput
4.task 678.1 -> InformCommitDone

 Can u show me the step? 
Thanks!

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-23 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880279#comment-15880279
 ] 

xukun edited comment on SPARK-18113 at 2/23/17 12:34 PM:
-

[~jinxing6042[~jinxing6...@126.com]
When appmaster get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while.
Yes, I think this is the root cause of this bug, which can be fixed by 
https://github.com/palantir/spark/pull/94.

Sorry, I don't understand. 
my steps is :
1.task 678.0 -> AskPermissionToCommitOutput
2.task 678.0 -> InformCommitDone
3.task 678.1 -> AskPermissionToCommitOutput
4.task 678.1 -> InformCommitDone

 Can u show me the step? 
Thanks!


was (Author: xukun):
[~jinxing6042[~jinxing6...@126.com]
When appmaster get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while.
Yes, I think this is the root cause of this bug, which can be fixed by 
https://github.com/palantir/spark/pull/94.

Sorry, I don't understand. Can u show me how to match case? 
Thanks!

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>  

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-21 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875599#comment-15875599
 ] 

jin xing edited comment on SPARK-18113 at 2/21/17 9:00 AM:
---

[~xukun]

I got the scenario you described. 

??When appmaster get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while.??
Yes, I think this is the root cause of this bug, which can be fixed by 
https://github.com/palantir/spark/pull/94.

??If executor is preempted after outputCommitCoordinator.commitDone, driver 
still enter into task deadloop??
I hardly agree with above. Since when *commitDone* is called, no more 
*AskPermissionToCommit* will be sent again. Thus I believe 
{code}
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
{code}
can clean the commit lock clearly, following *AskPermissionToCommit* from other 
task attempt will be authorized.
Thoughts ?


was (Author: jinxing6...@126.com):
[~xukun]

I got the scenario you described. 

??After driver retry task 678.1, executor execute 
outputCommitCoordinator.canCommit. Because It is a gap between appmaster get 
completed container statuses and executor is killed.??
Yes, I think this is the root cause of this bug, which can be fixed by 
https://github.com/palantir/spark/pull/94.

??If executor is preempted after outputCommitCoordinator.commitDone, driver 
still enter into task deadloop??
I hardly agree with above. Since when *commitDone* is called, no more 
*AskPermissionToCommit* will be sent again. Thus I believe 
{code}
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
{code}
can clean the commit lock clearly, following *AskPermissionToCommit* from other 
task attempt will be authorized.
Thoughts ?

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-15 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867124#comment-15867124
 ] 

xukun edited comment on SPARK-18113 at 2/16/17 1:58 AM:


[~aash]

According my scenario and [https://github.com/palantir/spark/pull/94] code

task 678.0
outputCommitCoordinator.canCommit will match 
CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) =>  
CommitState(attemptNumber, System.nanoTime(), MidCommit)

outputCommitCoordinator.commitDone match CommitState(existingCommitter, 
startTime, MidCommit) if attemptNumber == existingCommitter =>
 CommitState(attemptNumber, startTime, Committed)

task 678.1
outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, 
Committed) 

If executor is preempted after outputCommitCoordinator.commitDone, driver still 
enter into task deadloop


was (Author: xukun):
[~aash]

According my scenario and [https://github.com/palantir/spark/pull/94] code

task 678.0
outputCommitCoordinator.canCommit will match 
CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) =>  
CommitState(attemptNumber, System.nanoTime(), MidCommit)

outputCommitCoordinator.commitDone match CommitState(existingCommitter, 
startTime, MidCommit) if attemptNumber == existingCommitter =>
 CommitState(attemptNumber, startTime, Committed)

task 678.1
outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, 
Committed) 

then driver enter into task deadloop

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865644#comment-15865644
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 12:01 PM:
-

[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
-outputCommitCoordinator.canCommit  
finished task 678.0  RECEIVED SIGNAL 15: 
{code}

After driver retry task 678.1,  executor execute 
outputCommitCoordinator.canCommit. Because It is a gap between appmaster get 
completed container statuses and executor is killed.


was (Author: xukun):
[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
-outputCommitCoordinator.canCommit  
finished task 678.0  RECEIVED SIGNAL 15: 
{code}

After driver retry task 678.1,  executor execute 
outputCommitCoordinator.canCommit. Because when appmaster get completed 
container statuses

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865644#comment-15865644
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 11:58 AM:
-

[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
-outputCommitCoordinator.canCommit  
finished task 678.0  RECEIVED SIGNAL 15: 
{code}

After driver retry task 678.1,  executor execute 
outputCommitCoordinator.canCommit. Because when appmaster get completed 
container statuses


was (Author: xukun):
[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
-outputCommitCoordinator.canCommit  
finished task 678.0  RECEIVED SIGNAL 15: 
{code}



> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865644#comment-15865644
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 11:50 AM:
-

[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
--outputCommitCoordinator.canCommit 
 finished task 678.0  RECEIVED SIGNAL 15: 
{code}




was (Author: xukun):
[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- 
Authorizing attemptNumber=0,partition=678 --- Denying 
attemptNumber=1,partition=678 -- deadloop

executor:  Running task 678.0 
--outputCommitCoordinator.canCommit 
 finished task 678.0  RECEIVED SIGNAL 15: 
{code}



> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865644#comment-15865644
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 11:51 AM:
-

[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
-outputCommitCoordinator.canCommit  
finished task 678.0  RECEIVED SIGNAL 15: 
{code}




was (Author: xukun):
[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- Authorizing 
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- 
deadloop

executor:  Running task 678.0 
--outputCommitCoordinator.canCommit 
 finished task 678.0  RECEIVED SIGNAL 15: 
{code}



> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865644#comment-15865644
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 11:50 AM:
-

[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0 --- Lost executor --- retry 678.1 --- 
Authorizing attemptNumber=0,partition=678 --- Denying 
attemptNumber=1,partition=678 -- deadloop

executor:  Running task 678.0 
--outputCommitCoordinator.canCommit 
 finished task 678.0  RECEIVED SIGNAL 15: 
{code}




was (Author: xukun):
[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can 
not resolve this scenario.
{code}
driver:  starting task 678.0  Lost executor  retry 678.1  
Authorizing attemptNumber=0,partition=678 --- Denying 
attemptNumber=1,partition=678 -- deadloop

executor: Running task 678.0 --- 
--outputCommitCoordinator.canCommit  finished task 678.0  RECEIVED 
SIGNAL 15: 
{code}



> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
>

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865359#comment-15865359
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 8:34 AM:


[~jinxing6...@126.com]
yes. reproduce steps:

1. executor is running task x.0
2. appmaster get completed container statuses and tell driver to remove 
executor (now executor is not killed,it is still running.)
3. driver process CompletionEvent and TaskSetManager will call:
{code}
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
{code}
4.task x.0 request a task commit and coordinator approves commit and sends 
response message
5.drvier receive AskPermissionToCommitOutput and authorize attemptNumber=0 for 
partition x
6.driver starting task x.1 because of ExecutorLost and enter into task deadloop

When appmaster  get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while. After TaskSetManager call
{code}
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
{code}
, task x.0 will send AskPermissionToCommitOutput. So driver enter into task 
deadloop.



was (Author: xukun):
[~jinxing6...@126.com]
yes. reproduce steps:

1. executor is running task x.0
2. appmaster get completed container statuses and tell driver to remove 
executor (now executor is not killed,it is still running.)
3. driver process CompletionEvent and TaskSetManager will call:
{code}
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
{code}
4.task x.0 request a task commit and coordinator approves commit and sends 
response message
5.drvier receive AskPermissionToCommitOutput and authorize attemptNumber=0 for 
partition x
6.driver starting task x.1 and enter into task deadloop

When appmaster  get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while. After TaskSetManager call
{code}
sched.dagScheduler.taskEnded(
tasks(index), Resubmitted, null, Seq.empty, info)
{code}
, task x.0 will send AskPermissionToCommitOutput. So driver enter into task 
deadloop.


> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-07 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855585#comment-15855585
 ] 

xukun edited comment on SPARK-18113 at 2/7/17 8:59 AM:
---

[~jinxing6...@126.com] [~aash]
 I've met the same question. Here's the sequence of steps:
* executor requests a task commit
* coordinator approves commit and sends response message
* executor receive response message and goes susscess and send StatusUpdate to 
driver
* executor is preempted by YARN before receive StatusUpdate
* task retry because of ExecutorLost and no other attempt will succeed

I think this pr can not resolve this scene. can you give me some advices?


was (Author: xukun):
[~jinxing6...@126.com] [~aash]
 I've met the same question. Here's the sequence of steps:
* executor requests a task commit
* coordinator approves commit and sends response message
* executor goes susscess and send StatusUpdate to driver
* executor is preempted by YARN before reveive StatusUpdate
* task retry and no other attempt will succeed

I think this pr can not resolve this scene. can you give me some advices?

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at 

[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-07 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855585#comment-15855585
 ] 

xukun edited comment on SPARK-18113 at 2/7/17 8:54 AM:
---

[~jinxing6...@126.com] [~aash]
 I've met the same question. Here's the sequence of steps:
* executor requests a task commit
* coordinator approves commit and sends response message
* executor goes susscess and send StatusUpdate to driver
* executor is preempted by YARN before reveive StatusUpdate
* task retry and no other attempt will succeed

I think this pr can not resolve this scene. can you give me some advices?


was (Author: xukun):
[~jinxing6...@126.com] [~aash]
 I've met the same question.Here's the sequence of steps:
* executor requests a task commit
* coordinator approves commit and sends response message
* executor goes susscess and send StatusUpdate to driver
* executor is preempted by YARN before reveive StatusUpdate
* task retry and no other attempt will succeed

can you give me some advices?

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
>