[
https://issues.apache.org/jira/browse/YARN-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155466#comment-16155466
]
Rohith Sharma K S commented on YARN-7163:
-----------------------------------------
The below exception trace shows that heap space error came while switch from
standby to active.
{noformat}
2017-08-30 22:17:54,058 INFO recovery.ZKRMStateStore
(ZKRMStateStore.java:run(359)) - Fencing node
/rmstore-secure/ZKRMStateRoot/RM_ZK_FENCING_LOCK doesn't exist to delete
2017-08-30 22:17:54,063 INFO resourcemanager.ResourceManager
(ResourceManager.java:serviceStart(596)) - Recovery started
2017-08-30 22:17:54,065 INFO recovery.ZKRMStateStore
(ZKRMStateStore.java:processWatchEvent(918)) - Watcher event type: None with
state:SyncConnected for path:null for Service
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2017-08-30 22:17:54,065 INFO recovery.ZKRMStateStore
(ZKRMStateStore.java:processWatchEvent(926)) - ZKRMStateStore Session connected
2017-08-30 22:17:54,065 INFO recovery.RMStateStore
(RMStateStore.java:checkVersion(639)) - Loaded RM state version info 1.2
2017-08-30 22:31:11,907 ERROR zookeeper.ClientCnxn
(ClientCnxn.java:processEvent(625)) - Caught unexpected throwable
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuffer.append(StringBuffer.java:270)
at java.io.StringWriter.write(StringWriter.java:112)
at java.io.PrintWriter.write(PrintWriter.java:456)
at java.io.PrintWriter.write(PrintWriter.java:473)
at java.io.PrintWriter.print(PrintWriter.java:603)
at java.io.PrintWriter.println(PrintWriter.java:756)
at java.lang.Throwable$WrappedPrintWriter.println(Throwable.java:764)
at java.lang.Throwable.printStackTrace(Throwable.java:658)
at java.lang.Throwable.printStackTrace(Throwable.java:721)
at
org.apache.log4j.DefaultThrowableRenderer.render(DefaultThrowableRenderer.java:60)
at
org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
at
org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
at
org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at
org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:187)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1227)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getDataWithRetries(ZKRMStateStore.java:1058)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadApplicationAttemptState(ZKRMStateStore.java:618)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadRMAppState(ZKRMStateStore.java:603)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.loadState(ZKRMStateStore.java:472)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:601)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1008)
{noformat}
> RM crashes with OOM in secured cluster when HA is enabled
> ---------------------------------------------------------
>
> Key: YARN-7163
> URL: https://issues.apache.org/jira/browse/YARN-7163
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
>
> It is observed that RM crashes with heap space OOM in secure cluster(http
> authentication is kerborse) when RM HA is enabled.
> Scenario is
> 1. Start RM in HA secure mode. Lets say RM1 is active mode.
> 2. Run many applications so that it uses greater than 50% of heap space
> configured. Lets say, if heap space is 2GB, then run applications that occupy
> 1.5GB of heap space.
> 3. Switch RM to StandBy and bring back to Active! While recovering
> applications from state store, RM crashes with OOM.
> *Note* : This issue will happen only when RM is started as ACTIVE directly.
> (not switched from standby to active during start of JVM)
> Heap dump shows that RMAuthenticationFilter holds 60% heap space! And other
> 40% held by RMAppState which is during recovering from state store. This
> exceeds the heap space and crashes with OOM.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]