[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509802#comment-14509802
 ] 

zhihai xu commented on YARN-3536:
-

Is This issue similar as YARN-2834? 

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recovery gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510320#comment-14510320
 ] 

gu-chi commented on YARN-3536:
--

Thx, as the exception trace stack is almost, I once looked into this ticket. 
This patch is already merged into the current environment I use.
Not same cause.

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recovery gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508852#comment-14508852
 ] 

gu-chi commented on YARN-3536:
--

2015-04-21 03:52:31,395 | INFO  | AsyncDispatcher event handler | 
appattempt_1429597538411_0001_02 State change from RUNNING to FINAL_SAVING 
| 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704)
2015-04-21 03:52:31,397 | INFO  | AsyncDispatcher event handler | Updating 
application application_1429597538411_0001 with final state: FINISHING | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.rememberTargetTransitionsAndStoreState(RMAppImpl.java:988)
2015-04-21 03:52:31,397 | WARN  | main-SendThread(VM1228:24002) | Session 
0xd4cdaa0557f0005 for server VM1228/9.91.12.28:24002, unexpected error, closing 
socket connection and attempting reconnect | 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1126)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:368)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1105)
2015-04-21 03:52:31,499 | INFO  | AsyncDispatcher event handler | Exception 
while executing a ZK operation. | 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1098)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for 
/rmstore/ZKRMStateRoot/RMAppRoot/application_1429597538411_0001/appattempt_1429597538411_0001_02
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:993)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1085)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:993)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:683)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:792)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:866)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:861)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that 

[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508855#comment-14508855
 ] 

gu-chi commented on YARN-3536:
--

2015-04-21 04:22:33,923 | INFO  | main-EventThread | Recovering app: 
application_1429597538411_0001 with 2 attempts and final state = FINISHED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700)
2015-04-21 04:22:33,923 | INFO  | main-EventThread | Recovering attempt: 
appattempt_1429597538411_0001_01 with final state: FAILED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | Recovering attempt: 
appattempt_1429597538411_0001_02 with final state: null | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | Create AMRMToken for 
ApplicationAttempt: appattempt_1429597538411_0001_02 | 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createAndGetAMRMToken(AMRMTokenSecretManager.java:195)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | Creating password for 
appattempt_1429597538411_0001_02 | 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createPassword(AMRMTokenSecretManager.java:307)
2015-04-21 04:22:33,924 | INFO  | main-EventThread | 
appattempt_1429597538411_0001_01 State change from NEW to FAILED | 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704)
2015-04-21 04:22:33,925 | INFO  | main-EventThread | Registering app attempt : 
appattempt_1429597538411_0001_02 | 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerAppAttempt(ApplicationMasterService.java:656)
2015-04-21 04:22:33,925 | ERROR | main-EventThread | Failed to load/recover 
state | 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:533)
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97)

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover

2015-04-23 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508900#comment-14508900
 ] 

gu-chi commented on YARN-3536:
--

Please assign this to me for fixing

 ZK exception occur when updating AppAttempt status, then NPE thrown when RM 
 do recover
 --

 Key: YARN-3536
 URL: https://issues.apache.org/jira/browse/YARN-3536
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.4.1
Reporter: gu-chi

 Here is a scenario that Application status is FAILED/FINISHED but AppAttempt 
 status is null, this cause NPE when doing recover with 
 yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should 
 handle recovery gracefully



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)