[
https://issues.apache.org/jira/browse/YARN-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316777#comment-16316777
]
Jason Lowe commented on YARN-2185:
----------------------------------
Thanks for the patch!
This patch adds parallel copying of directories and ability to ignore
timestamps which IMHO is outside the scope of this patch. I think those should
be handled in separate JIRAs with appropriate unit tests.
Do we really want a while loop to pull off extensions? That's not how it
behaved before, and it will now do different, potentially unexpected, things
for a ".jar.gz", ".tgz.tgz", ".gz.jar", etc. Granted these are unlikely
filenames to encounter, but if somehow a user had those before and were
working, this will change that behavior.
Does it make sense to use an external process to unpack .jar and .zip files?
The original code did this via the JDK directly rather than requiring a
subprocess, and given there's a ZipInputStream and JarInputStream, it seems we
could do the same here with the input stream rather than requiring the overhead
of a subprocess and copying of data between streams.
This now logs an INFO message for all stderr and stdout for the subprocess
(i.e.: at least every entry in the archive), which is not something it did
before. This should be debug, if logged at all, and it requires buffering all
of the output. Some of our users have some crazy use cases with very
large/deep tarballs, and grabbing all the output from tar would not be a
trivial amount of memory for the NM or a localizer heap, especially if there
are multiple of these at the same time. If the log message is kept, the
verbose flag should be passed to the command only if the output will be logged.
I think it would be better to explicitly create the directory from the Java
code rather than squirrel it into the tar command. We can provide a better
error message with more context if done directly, and as it is now the mkdir
can fail but blindlly proceed to the tar command.
Wouldn't "cd _destpath_ && tar -xf" be better than "cd _destpath_; tar -xf"?
Otherwise if somehow the change directory fails it will end up unpacking to an
unexpected place.
Is the "rm -rf" necessary?
Has this been tested on Windows? The zipOrJar path appears to be eligible for
Windows but uses shell syntax that {{cmd}} is not going to understand, like
semicolons between commands.
Does the single-quote escaping work properly? Escaping double quotes with a
backslash works within a double-quoted string, but the same is not true for
single-quoted strings. The shell does not interpret any characters in a
single-quoted string, even backslash characters. Therefore the only way I know
of to put a single-quote character in a single-quoted string is to terminate
the current string, insert an escaped single quote (since we're not in a
single-quote parsing context at this point), then start the single-quote string
up again. In other words, each {{'}} character trying to be escaped within a
single-quote context becomes the four character sequence: {{'\''}} For
example, single-quoting {{abc'def'ghi}} becomes {{'abc'\''def'\''ghi'}}.
I think we should shutdown the executor service if we created it, otherwise we
risk running out of threads before the garbage collector gets around to
noticing it can free everything up. Either that or require the caller to
provide the executor. Note that if the caller provides a single thread
executor (or some other executor that can only execute serially) then this can
deadlock if the process emits a lot of output on stderr. The thread trying to
consume stdout data is blocked until the child process completes, but the child
process is blocked trying to emit stderr output that the parent process is not
consuming. I wonder if it would be better to add a method to Shell that takes
an input stream so we could reuse all the subprocess handling code there.
This debug log was upgraded to an info, yet it still has the debug check in
front of it. I'm not sure this should be upgraded to an info log as part of
this change.
{code}
if (LOG.isDebugEnabled()) {
LOG.info(String.format("Starting to download %s %s %s",
sCopy,
resource.getType(),
resource.getPattern()));
}
{code}
I don't see how the following code treats PATTERN files that don't end in
'.jar' like ARCHIVE files since the resource type check will still find PATTERN
instead of ARCHIVE when it falls through from the first {{if}} to the second:
{code}
if (resource.getType() == LocalResourceType.PATTERN) {
if (destinationFile.endsWith(".jar")) {
// Unpack and keep a copy of the whole jar for mapreduce
String p = resource.getPattern();
RunJar.unJarAndSave(inputStream, new File(destination.toUri()),
source.getName(),
p == null ? RunJar.MATCH_ANY : Pattern.compile(p));
return;
} else {
LOG.warn("Treating [" + source + "] " +
"as an archive even though it " +
"was specified as PATTERN");
}
}
if (resource.getType() == LocalResourceType.ARCHIVE) {
if (FileUtil.decompress(
inputStream, source.getName(), destination, executor)) {
return;
} else {
LOG.warn("Cannot unpack [" + source + "] trying to copy");
}
}
{code}
Is there a reason to implement the copy loop directly in the new unJar method
rather than calling copyBytes as the other one does?
> Use pipes when localizing archives
> ----------------------------------
>
> Key: YARN-2185
> URL: https://issues.apache.org/jira/browse/YARN-2185
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.4.0
> Reporter: Jason Lowe
> Assignee: Miklos Szegedi
> Attachments: YARN-2185.000.patch, YARN-2185.001.patch,
> YARN-2185.002.patch
>
>
> Currently the nodemanager downloads an archive to a local file, unpacks it,
> and then removes it. It would be more efficient to stream the data as it's
> being unpacked to avoid both the extra disk space requirements and the
> additional disk activity from storing the archive.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]