[
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113417#comment-16113417
]
Jason Lowe commented on YARN-6929:
----------------------------------
bq. I think Max Bucket Size can be derived from
yarn.log-aggregation.retain-seconds (in days)
The whole point of bucketing here is to avoid a maximum limit per directory.
Trying to derive a bucket size from a time value like retention seconds means
we need to know the apps-per-second rate of the cluster which we do not know.
In addition it risks blowing the directory limit if that rate ends up higher
than what was calculated for a sustained period. It makes more sense to derive
the bucket size from the directory limit since that's what's driving this
change.
bq. And why we need two sub directories (app_id/ bucket_size) and
(app_id%bucket_size).
Sorry, we just need it to be (app_id / bucket_size), we don't need the modulo.
So the bucket path for an app would be:
aggregation_log_root / user / cluster_timestamp / (app_id/ bucket_size)
with at most bucket_size number of app directories in each.
The log deletion service can clean up empty bucket directories when it removes
the last app from a directory. It's a little tricky for the bucket delete case
since we need to consider the scenario where we want to delete just as a
long-running app tries to aggregate to the same bucket, but as long as the
aggregation process creates the bucket if necessary and the deletion service
takes care to only delete empty bucket directories (i.e.: no recursive delete)
we should be fine.
This should handle any app submission rate up to a point where we are
aggregating the square of bucket_size apps in the retention period. Beyond
that point we'd have more than bucket_size bucket directories.
> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -------------------------------------------------------------
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
> Issue Type: Bug
> Components: log-aggregation
> Affects Versions: 2.7.3
> Reporter: Prabhu Joseph
> Assignee: Prabhu Joseph
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102).
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more
> chances LogAggregationService fails to create a new directory with
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is
> <yarn.nodemanager.remote-app-log-dir>/<user>/logs/<job_name>. This can be
> improved with adding date as a subdirectory like
> <yarn.nodemanager.remote-app-log-dir>/<user>/logs/<date>/<job_name>
> {code}
> WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
> Application failed to init aggregation
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
> The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576
> items=1048576
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
> The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576
> items=1048576
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>
> {code}
> Thanks to Robert Mancuso for finding this issue.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]