[
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325157#comment-15325157
]
Sangjin Lee commented on YARN-4958:
-----------------------------------
Sorry [~templedf] it took me a while to get back to this.
I finally got around to trying out the patch with a pseudo-distributed setup. I
can confirm that the main use cases seem to work correctly, including the case
of a non-jar resource in the staging directory. That said, I do have some high
level comments as well as a couple of minor nits.
(1)
Regarding the decision of determining the public-ness solely based on the
parent directory in the case of the wildcard, I'm wondering whether that would
have any implications. It's probably not going to be common, but it is possible
that the directory is public but there may be files that are not readable by
others. Again, it's hard to imagine why one would do this, but if they did,
would it cause a security issue on localization or a localization failure?
Should we chalk that up to an unsupported setting? To be fair, I can see this
being an issue if a directory was specified (not a wildcard), too. In that
sense, we could say this is a manifestation of an existing issue... Thoughts?
(2)
With {{ContainerExecutor.java}}, what happens if the wildcarded directory has
further nested directories? It appears we're symlinking only at the immediate
child level. I suspect it would work correctly, but wanted to double check.
(3)
Were you able to test it in the local job mode?
(4) {{ClientDistributedCacheManager.java}}
- l.303: change {{System.out.println()}} to a logger logging statement
(5) {{DistributedCache.java}}
- l.295: typo: "it's" -> "its"
Finally, since most of the changes are really in MAPREDUCE, perhaps this JIRA
should be moved to the MAPREDUCE project. What do you think? If we really want
to follow the rules to the letter, we would need to create separate JIRAs for
all projects involved (HADOOP, YARN, and MAPREDUCE). I'd like to hear what you
think. On a related note, you may want to drop the changes to {{Path.java}} if
you can help it.
> The file localization process should allow for wildcards to reduce the
> application footprint in the state store
> ---------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4958
> URL: https://issues.apache.org/jira/browse/YARN-4958
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.8.0
> Reporter: Daniel Templeton
> Assignee: Daniel Templeton
> Priority: Critical
> Attachments: YARN-4958.001.patch, YARN-4958.002.patch,
> YARN-4958.003.patch
>
>
> When using the -libjars option to add classes to the classpath, every library
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local
> resources even though they're all uploaded to the same directory in HDFS.
> When using tools like Crunch without an uber JAR or when trying to take
> advantage of the shared cache, the number of libraries can be quite large.
> We've seen many cases where we had to turn down the max number of
> applications to prevent ZK from running out of heap because of the size of
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the
> NM allow wildcards in the resource localization paths. Specifically, we
> propose to allow a path to have a final component (name) set to "*", which is
> interpreted by the NM as "download the full directory and link to every file
> in it from the job's working directory." This behavior is the same as the
> current behavior when using -libjars, but avoids explicitly listing every
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as
> "\*.jar" or "file\*", as having multiple entries for a single directory
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache. That
> work will be left to a future JIRA. Specifically, this JIRA only applies
> when a full directory is uploaded. Currently the shared cache does not
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of
> the -libjars switch and in paths added through the {{Job}} and
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of
> all file verification and localization. In the final step, the NM will query
> the localized directory to get a list of the files in "dir" such that each
> can be linked from the job's working directory. Since $PWD/\* is always
> included on the classpath, all JAR files in "dir" will be in the classpath.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]