[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259281#comment-14259281
 ] 

Karthik Kambatla commented on YARN-2992:
----------------------------------------

bq. one question: do we need to create a new zkClient object by calling 
createConnection, or is it OK to re-use the old one ?
Thought about it some at the time of working on the patch. We probably don't 
need the call to createConnection, as the watcher would probably go off before 
the next retry or the next. However, given the frequency of session expiries 
and lost connections, I felt it should be okay to explicitly createConnection. 
I don't think that will add a significant overhead or lead to inaccuracies.


> ZKRMStateStore crashes due to session expiry
> --------------------------------------------
>
>                 Key: YARN-2992
>                 URL: https://issues.apache.org/jira/browse/YARN-2992
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to