[
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258150#comment-15258150
]
Daniel Templeton commented on YARN-4958:
----------------------------------------
Thanks for the review, [~sjlee0]! I'm working on unit tests and cleaning up
the code right now. I hope to have an updated patch by tomorrow.
bq. does this work correctly if we're dealing with a non-jar entry in the
staging libjars directory?
Depends of the definition of correctly. :) I defined * similar to your
definition of * from HADOOP-12747: all JARs go in the classpath. This patch
also links all non-JARs from the working directory, but they are not added to
the classpath. I think that behavior is more consistent with HADOOP-12747 than
adding everything to the classpath.
bq. Would there be any cross-platform issues?
Good question. I was careful to keep the changes agnostic, but who knows.
Probably worth testing.
bq. This is just noting that the size of a wildcard entry (as in
mapreduce.job.cache.files.filesizes) would be reported as 0.
Didn't notice that one. What would be the best behavior? Report the aggregate
file size?
> The file localization process should allow for wildcards to reduce the
> application footprint in the state store
> ---------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.8.0
> Reporter: Daniel Templeton
> Assignee: Daniel Templeton
> Priority: Critical
> Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local
> resources even though they're all uploaded to the same directory in HDFS.
> When using tools like Crunch without an uber JAR or when trying to take
> advantage of the shared cache, the number of libraries can be quite large.
> We've seen many cases where we had to turn down the max number of
> applications to prevent ZK from running out of heap because of the size of
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the
> NM allow wildcards in the resource localization paths. Specifically, we
> propose to allow a path to have a final component (name) set to "*", which is
> interpreted by the NM as "download the full directory and link to every file
> in it from the job's working directory." This behavior is the same as the
> current behavior when using -libjars, but avoids explicitly listing every
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as
> "\*.jar" or "file\*", as having multiple entries for a single directory
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache. That
> work will be left to a future JIRA. Specifically, this JIRA only applies
> when a full directory is uploaded. Currently the shared cache does not
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of
> the -libjars switch and in paths added through the {{Job}} and
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of
> all file verification and localization. In the final step, the NM will query
> the localized directory to get a list of the files in "dir" such that each
> can be linked from the job's working directory. Since $PWD/\* is always
> included on the classpath, all JAR files in "dir" will be in the classpath.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)