[
https://issues.apache.org/jira/browse/YARN-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324545#comment-16324545
]
Jason Lowe commented on YARN-2185:
----------------------------------
Thanks for updating the patch!
bq. I was wondering, if it should be better for OS specific archives containing
symlinks for example.
symlinks is a good example of why it is important to use {{tar}} on Linux.
However if we're going to use OS tools to support ZIP files properly then we
should not be using the {{jar}} command. {{jar}} does not properly support
symlinks in a ZIP archive but {{unzip}} does. To be honest I don't see the
appeal of using the jar command for anything since I do not see how it is going
to do anything different than the JDK builtin support since it is part of the
JDK. As with the parallel copying and timestamp ignore support, I think it may
be best to have this patch focus on using pipes rather than also changing the
semantics of what tools are used to unpack the archives. We can discuss what
should be done with jars and zips in another JIRA.
Speaking of symbolic links, now that all supported releases are on JDK7 or
later, we can update unTarUsingJava to fix the lacking symlink support. I
filed HADOOP-15170 to track that.
The failure on the path containing a single quote seems overly paranoid.
Single quotes can be escaped in the POSIX shell code path per my previous
comment, e.g.:
{code}
filename.replace("'","'\\''")
{code}
Rather than do the subshell with an extra {{cd}}, could it be simplified to
just {{ tar -C }}?
It doesn't look like the following comment was addressed as runCommandOnStream
is still creating an executor sometimes yet does not clean it up in those
cases. The Javadoc should also be updated to document the deadlock potential.
{quote}
I think we should shutdown the executor service if we created it, otherwise we
risk running out of threads before the garbage collector gets around to
noticing it can free everything up. Either that or require the caller to
provide the executor.
{quote}
Speaking of deadlock, if debug logging is not enabled then this can deadlock in
the manner I described above. That's one of the reasons why I think leveraging
Shell isn't a terrible idea. Wielding ProcessBuilder and Process properly is
trickier than most people believe. It also side-steps a lot of the executor
management stuff. I'm OK if you still think it's better to roll our own here.
"parallelVerifyAndCopy" is no longer parallel so the name and javadoc is
misleading. It also does not throw InterruptedException nor ExecutionException.
It's weird to allow the user to specify a downloader thread count yet we setup
and teardown the executor for every unpack call and the unpack call is only
going to use two threads. Essentially it shouldn't be a config since any value
but 2 is either wrong or wasteful.
Speaking of executors, would it makes more sense for FSDownload to be able to
leverage an existing executor rather than creating its own? All of the
existing usages of FSDownload are using an executor of one form or another, so
it'd be nice not to have to keep building and tearing these down for every call.
I'm still confuse about the recursive destination directory delete that was
added. It was not there before, and I don't think it's appropriate here. What
are we worried is going to be there, and if somehow something is there, are we
sure it's OK to just blindly remove it?
destionationTmp s/b destinationTmp
FSDownloadException is no longer needed.
> Use pipes when localizing archives
> ----------------------------------
>
> Key: YARN-2185
> URL: https://issues.apache.org/jira/browse/YARN-2185
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 2.4.0
> Reporter: Jason Lowe
> Assignee: Miklos Szegedi
> Attachments: YARN-2185.000.patch, YARN-2185.001.patch,
> YARN-2185.002.patch, YARN-2185.003.patch, YARN-2185.004.patch
>
>
> Currently the nodemanager downloads an archive to a local file, unpacks it,
> and then removes it. It would be more efficient to stream the data as it's
> being unpacked to avoid both the extra disk space requirements and the
> additional disk activity from storing the archive.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]