[
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813661#comment-16813661
]
Prabhu Joseph commented on YARN-6929:
-------------------------------------
[~pbacsko] Thanks for reviewing. Have attached 008 patch addressing above
comments. Please review the same when you get time.
Below are the details
1. During upgrade the job will have some log files in old app log dir and some
in new app log dir. The reader logic has to return node log files from both
places.
The logic returns an Iterator of node files only from new app log dir if
{{yarn.nodemanager.remote-app-log-dir-include-older}} is false. Else, a
combined iterator which traverses both old and new log files. {{nodeFilesPrev}}
and {{nodeFilesCur}} are iterators of old and new app log dir respectively.
Have added comments and few changes in code to make it more readable.
2. Have used {{diagnosticsMsg}} in all places.
3. {{nodeFilesCur}} can be null only if there is an {{IOException}} (new app
log dir does not exist or error when reading). In this case, throw the captured
{{diagnosticsMsg}} else the {{nodeFilesCur}}.
4. {{diagnosticsMsg}} is appended max twice and also with
{{IOException#getMessage()}} which is a limited one without stacktrace.
5. Have addressed this one.
> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -------------------------------------------------------------
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation
> Affects Versions: 2.7.3
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
> Priority: Major
> Attachments: YARN-6929-007.patch, YARN-6929-008.patch,
> YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.2.patch, YARN-6929.3.patch,
> YARN-6929.4.patch, YARN-6929.5.patch, YARN-6929.6.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102).
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more
> chances LogAggregationService fails to create a new directory with
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is
> <yarn.nodemanager.remote-app-log-dir>/<user>/logs/<job_name>. This can be
> improved with adding date as a subdirectory like
> <yarn.nodemanager.remote-app-log-dir>/<user>/logs/<date>/<job_name>
> {code}
> WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
> Application failed to init aggregation
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
> The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576
> items=1048576
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
> The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576
> items=1048576
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>
> {code}
> Thanks to Robert Mancuso for finding this issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]