Dustin Cote updated YARN-3934:
    Attachment: YARN-3934-1.patch

Here's a first attempt at the fix.  We cannot know with certainty what ZK has 
set for jute.maxbuffer on the server side, so we have to make the assumption 
that it matches what is on the client side (in this case the RM).  I've setup 
the code to read the property as a system property which is how we normally 
specify it.  There may be a desire to standardize it into the YARN config later 
on, but I think that's outside the scope of fixing this.  Without the patch, 
the ZK connection is broken and retried by default *1000* times, so the RM 
doesn't go down for awhile and all applications are blocked from submission.  I 
think it's probably worth revisiting that default value as well, but I'd like 
some feedback from reviewers on that if we should open a separate JIRA there.

> Application with large ApplicationSubmissionContext can cause RM to exit when 
> ZK store is used
> ----------------------------------------------------------------------------------------------
>                 Key: YARN-3934
>                 URL: https://issues.apache.org/jira/browse/YARN-3934
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Dustin Cote
>         Attachments: YARN-3934-1.patch
> Use the following steps to test.
> 1. Set up ZK as the RM HA store.
> 2. Submit a job that refers to lots of distributed cache files with long HDFS 
> path, which will cause the app state size to exceed ZK's max object size 
> limit.
> 3. RM can't write to ZK and exit with the following exception.
> {noformat}
> 2015-07-10 22:21:13,002 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>         at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935)
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:944)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:941)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1083)
> {noformat}
> In this case, RM could have rejected the app during submitApplication RPC if 
> the size of ApplicationSubmissionContext is too large.

This message was sent by Atlassian JIRA

Reply via email to