[
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257160#comment-15257160
]
Sangjin Lee commented on YARN-4958:
-----------------------------------
Thanks for your proposal [~templedf]! This is an important improvement. I took
a look at the patch, and have some early feedback. I haven't had a chance to
run the patch against a cluster yet, however.
A quick note on the shared cache: I don't think this patch will break the
shared cache. This patch goes to work mostly after the shared cache has done
its part. The only thing is that jobs with a set of resources that are heavily
drawn from the shared cache would not benefit from this patch as there would be
few classpath files that will be coming from the staging directory. But that's
for later...
Also, I think HADOOP-12747 is largely orthogonal. It merely gives users a
shorthand to address a set of jars.
More questions and comments:
- We'll need unit tests for this.
- I suppose a test will quickly confirm this, but does this work correctly if
we're dealing with a non-jar entry in the staging libjars directory? I just
wanted to confirm that it gets added to the container-side classpath
explicitly. The java classpath of "\*" does not include non-jar resources in
the directory.
- Would there be any cross-platform issues? Have you had a chance to test it on
Windows specifically? At first glance, there was nothing obvious that might be
a platform-specific issue, but it would be good to double check.
- This is just noting that the size of a wildcard entry (as in
{{mapreduce.job.cache.files.filesizes}}) would be reported as 0. This is an
existing behavior/issue with a directory entry.
(ClientDistributedCacheManager.java)
- l.242-243: Would it be simpler to reset {{current}} to the parent directory
and simply invoke {{ancestorsHaveExecutePermissions()}} on it instead? Then,
{{getFileStatus}} doesn't need to change, and the stat cache would also have
only real paths (i.e. no "*" paths). Thoughts?
(MRApps.java)
- l.323-329: Would there be a case where there can be multiple attempts for the
same directory? Is it for the case both "dir" and "dir/*" are included in
cache.files? I'm not sure if you're addressing a new concern or an existing one.
- l.338-341: Why would there be a wildcard for the paths (which come from
{{mapreduce.job.classpath.files}})?
(ContainerExecutor.java)
- This would apply to *any* entries that have the wildcard, and would effect
things like {{PWD/*}} too?
> The file localization process should allow for wildcards to reduce the
> application footprint in the state store
> ---------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.8.0
> Reporter: Daniel Templeton
> Assignee: Daniel Templeton
> Priority: Critical
> Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local
> resources even though they're all uploaded to the same directory in HDFS.
> When using tools like Crunch without an uber JAR or when trying to take
> advantage of the shared cache, the number of libraries can be quite large.
> We've seen many cases where we had to turn down the max number of
> applications to prevent ZK from running out of heap because of the size of
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the
> NM allow wildcards in the resource localization paths. Specifically, we
> propose to allow a path to have a final component (name) set to "*", which is
> interpreted by the NM as "download the full directory and link to every file
> in it from the job's working directory." This behavior is the same as the
> current behavior when using -libjars, but avoids explicitly listing every
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as
> "\*.jar" or "file\*", as having multiple entries for a single directory
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache. That
> work will be left to a future JIRA. Specifically, this JIRA only applies
> when a full directory is uploaded. Currently the shared cache does not
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of
> the -libjars switch and in paths added through the {{Job}} and
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of
> all file verification and localization. In the final step, the NM will query
> the localized directory to get a list of the files in "dir" such that each
> can be linked from the job's working directory. Since $PWD/\* is always
> included on the classpath, all JAR files in "dir" will be in the classpath.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)