[
https://issues.apache.org/jira/browse/YARN-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496398#comment-15496398
]
Eric Badger commented on YARN-5641:
-----------------------------------
[~jlowe] and I worked on this for some time yesterday and killing the spawned
untar shell process is proving to be very difficult. The localizer spawns up
the untar shell thread, which invokes a shell exec untar command. Once the
container is killed, the next time the localizer heartbeats to the NM, it will
be instructed to die. Inside of the 'die' codepath, the localizer interrupts
all of its spawned threads using the cancel() method. However, the untar thread
is stuck inside of file I/O waiting to parse the result of the shell execution
and is uninterruptible. The untar thread won't get the InterruptedException
until it is finished, and so we cannot kill it or the untar shell exec before
it completes. We can have the localizer process wait for the untar thread to
end via awaitTermination() (currently it only uses shutdownNow()), but it won't
return until untar finishes on its own, since shutdown() won't have any effect
with interrupting the untar thread.
I tested this by replacing the untar shell command with a sleep command so that
there would be no worry about the untar actually finishing. The container was
killed and instructed to die after the subsequent NM heartbeat. Then it
attempted to shutdown all of its threads, but the untar thread would sit in
readBytes instead of getting the InterruptedException. Below is the stack trace
of the untar thread just after the localizer calls shutdown(). It never gets
the InterruptedException and sits in this stack trace until awaitTermination
hits its timeout and the localizer kills the JVM. Since we never catch the
InterruptedException, we are unable to destroy the untar shell process and it
continues to run after the localizer and untar thread are killed (it became
owned by init).
{noformat}
"ContainerLocalizer Downloader" #19 prio=5 os_prio=0 tid=0x00007f4315169800
nid=0x1530 runnable [0x00007f42f5217000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0x000000076f4fca28> (a
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x000000076f506cf8> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <0x000000076f506cf8> (a java.io.InputStreamReader)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.fs.FileUtil.unTarUsingTar(FileUtil.java:682)
at org.apache.hadoop.fs.FileUtil.unTar(FileUtil.java:651)
at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:283)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
> Localizer leaves behind tarballs after container is complete
> ------------------------------------------------------------
>
> Key: YARN-5641
> URL: https://issues.apache.org/jira/browse/YARN-5641
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Eric Badger
> Assignee: Eric Badger
>
> The localizer sometimes fails to clean up extracted tarballs leaving large
> footprints that persist on the nodes indefinitely.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]