[ 
https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876806#comment-13876806
 ] 

Karthik Kambatla commented on YARN-1602:
----------------------------------------

We run into these when using the ZKRMStateStore. Below is a sample log. Just 
realized - ExitUtil.terminate doesn't log the cause. Created YARN-1616 to fix 
the logging issue. 
{code}
2014-01-18 05:18:47,955 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Creating password for identifier: owner=jenkins, renewer=oozie mr token, 
realUser=oozie, issueDate=1390051127955, maxDate=1390655927955, 
sequenceNumber=154, masterKeyId=178
2014-01-18 05:18:47,955 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 storing RMDelegation token with sequence number: 154
2014-01-18 05:18:47,973 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED
2014-01-18 05:18:47,975 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}

The error doesn't seem to be a transient one - looks the zk-store retries when 
it encounters a KeeperException different from the NoAuthException.

bq. With HA states now, we should ideally not kill the RM but just 
transitionToStandby().
True. Unfortunately, the second RM that takes over would try repeating the same 
operation and run into the same issue. Would it make sense to kill the 
application, and clear the store of the offending operations - store/update 
app-related information (including tokens).

Let me re-run with YARN-1616 fixed and get more information on this. 

> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
>                 Key: YARN-1602
>                 URL: https://issues.apache.org/jira/browse/YARN-1602
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception, 
> either a RMFatalEvent.STATE_STORE_FENCED or 
> RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in 
> the RM failing. Instead, we should probably kill the application 
> corresponding to the store operation. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to