[ 
https://issues.apache.org/jira/browse/YARN-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155466#comment-16155466
 ] 

Rohith Sharma K S commented on YARN-7163:
-----------------------------------------

The below exception trace shows that heap space error came while switch from 
standby to active.
{noformat}
2017-08-30 22:17:54,058 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:run(359)) - Fencing node 
/rmstore-secure/ZKRMStateRoot/RM_ZK_FENCING_LOCK doesn't exist to delete
2017-08-30 22:17:54,063 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(596)) - Recovery started
2017-08-30 22:17:54,065 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(918)) - Watcher event type: None with 
state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2017-08-30 22:17:54,065 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(926)) - ZKRMStateStore Session connected
2017-08-30 22:17:54,065 INFO  recovery.RMStateStore 
(RMStateStore.java:checkVersion(639)) - Loaded RM state version info 1.2
2017-08-30 22:31:11,907 ERROR zookeeper.ClientCnxn 
(ClientCnxn.java:processEvent(625)) - Caught unexpected throwable
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuffer.append(StringBuffer.java:270)
        at java.io.StringWriter.write(StringWriter.java:112)
        at java.io.PrintWriter.write(PrintWriter.java:456)
        at java.io.PrintWriter.write(PrintWriter.java:473)
        at java.io.PrintWriter.print(PrintWriter.java:603)
        at java.io.PrintWriter.println(PrintWriter.java:756)
        at java.lang.Throwable$WrappedPrintWriter.println(Throwable.java:764)
        at java.lang.Throwable.printStackTrace(Throwable.java:658)
        at java.lang.Throwable.printStackTrace(Throwable.java:721)
        at 
org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
        at 
org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
        at 
org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
        at 
org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at 
org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:187)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1227)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getDataWithRetries(ZKRMStateStore.java:1058)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadApplicationAttemptState(ZKRMStateStore.java:618)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:603)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:472)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:601)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1008)
{noformat}

> RM crashes with OOM in secured cluster when HA is enabled
> ---------------------------------------------------------
>
>                 Key: YARN-7163
>                 URL: https://issues.apache.org/jira/browse/YARN-7163
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>
> It is observed that RM crashes with heap space OOM in secure cluster(http 
> authentication is kerborse) when RM HA is enabled. 
> Scenario is 
> 1. Start RM in HA secure mode. Lets say RM1 is active mode.
> 2. Run many applications so that it uses greater than 50% of heap space 
> configured. Lets say, if heap space is 2GB, then run applications that occupy 
> 1.5GB of heap space. 
> 3. Switch RM to StandBy and bring back to Active! While recovering 
> applications from state store, RM crashes with OOM. 
> *Note* : This issue will happen only when RM is started as ACTIVE directly. 
> (not switched from standby to active during start of JVM)
> Heap dump shows that RMAuthenticationFilter holds 60% heap space! And other 
> 40% held by RMAppState which is during recovering from state store. This 
> exceeds the heap space and crashes with OOM. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to