[
https://issues.apache.org/jira/browse/YARN-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310156#comment-16310156
]
Xuan Gong commented on YARN-7697:
---------------------------------
The issue happens after file truncate process. Looks like the truncate API
return false instead of throw exception, so we still read the corrupted
aggregated log.
In the process of reading logs, we would allocate a byte array
{code}
byte[] array = new byte[offset]; // this line throws OOM
fsDataIStream.seek(
fileLength - offset - Integer.SIZE/ Byte.SIZE - UUID_LENGTH);
{code}
So, the offset is in-correct, and probably a invalid big value, we could get
OOM in NM.
> NM goes down with OOM due to leak in log-aggregation
> ----------------------------------------------------
>
> Key: YARN-7697
> URL: https://issues.apache.org/jira/browse/YARN-7697
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Santhosh B Gowda
> Assignee: Xuan Gong
>
> 2017-12-29 01:43:50,601 FATAL yarn.YarnUncaughtExceptionHandler
> (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread
> Thread[LogAggregationService #0,5,main] threw an Error. Shutting down now...
> java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.loadIndexedLogsMeta(LogAggregationIndexedFileController.java:823)
> at
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.loadIndexedLogsMeta(LogAggregationIndexedFileController.java:840)
> at
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriterInRolling(LogAggregationIndexedFileController.java:293)
> at
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.access$600(LogAggregationIndexedFileController.java:98)
> at
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:216)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
> at
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:197)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:205)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:312)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:284)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2017-12-29 01:43:50,601 INFO application.ApplicationImpl
> (ApplicationImpl.java:handle(464)) - Application ap
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]