[ 
https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258346#comment-15258346
 ] 

Sangjin Lee commented on YARN-4958:
-----------------------------------

{quote}
Depends of the definition of correctly.  I defined * similar to your definition 
of * from HADOOP-12747: all JARs go in the classpath. This patch also links all 
non-JARs from the working directory, but they are not added to the classpath. I 
think that behavior is more consistent with HADOOP-12747 than adding everything 
to the classpath.
{quote}

Actually there is more to it. HADOOP-12747 is orthogonal to this.

Suppose the user specified
{noformat}
-libjars lib/foo.xml
{noformat}

The intent is to upload that file to the staging directory and make it part of 
the task classpath. So, (1) that file should be uploaded to the staging/libjars 
directory (and localized of course), and (2) more importantly it should be made 
part of the task classpath. As we noted, {{libjars/*}} will not pick up this 
file. Therefore the task classpath should look something like

{noformat}
libjars/*:libjars/foo.xml:...
{noformat}

In other words, non-jar entries in the libjar directory must be explicitly 
enumerated. I believe this is what the {{addToClassPathIfNonJar()}} is all 
about. This should preserve this behavior. I suspect the current patch does 
that, but was wondering if you could confirm it with a test.

bq. Didn't notice that one. What would be the best behavior? Report the 
aggregate file size?
The ideal behavior would be the aggregate size under the directory, but there 
would be complexity. Also, given that this is an existing issue, I'm fine with 
filing a separate JIRA to discuss and address it.

> The file localization process should allow for wildcards to reduce the 
> application footprint in the state store
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4958
>                 URL: https://issues.apache.org/jira/browse/YARN-4958
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-4958.001.patch
>
>
> When using the -libjars option to add classes to the classpath, every library 
> so added is explicitly listed in the {{ContainerLaunchContext}}'s local 
> resources even though they're all uploaded to the same directory in HDFS.  
> When using tools like Crunch without an uber JAR or when trying to take 
> advantage of the shared cache, the number of libraries can be quite large.  
> We've seen many cases where we had to turn down the max number of 
> applications to prevent ZK from running out of heap because of the size of 
> the state store entries.
> Rather than listing all files independently, this JIRA proposes to have the 
> NM allow wildcards in the resource localization paths.  Specifically, we 
> propose to allow a path to have a final component (name) set to "*", which is 
> interpreted by the NM as "download the full directory and link to every file 
> in it from the job's working directory."  This behavior is the same as the 
> current behavior when using -libjars, but avoids explicitly listing every 
> file.
> This JIRA does not attempt to provide more general purpose wildcards, such as 
> "\*.jar" or "file\*", as having multiple entries for a single directory 
> presents numerous logistical issues.
> This JIRA also does not attempt to integrate with the shared cache.  That 
> work will be left to a future JIRA.  Specifically, this JIRA only applies 
> when a full directory is uploaded.  Currently the shared cache does not 
> handle directory uploads.
> This JIRA proposes to allow for wildcards both in the internal processing of 
> the -libjars switch and in paths added through the {{Job}} and 
> {{DistributedCache}} classes.
> The proposed approach is to treat a path, "dir/\*", as "dir" for purposes of 
> all file verification and localization.  In the final step, the NM will query 
> the localized directory to get a list of the files in "dir" such that each 
> can be linked from the job's working directory.  Since $PWD/\* is always 
> included on the classpath, all JAR files in "dir" will be in the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to