[ 
https://issues.apache.org/jira/browse/YARN-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated YARN-6078:
---------------------------------
    Attachment: YARN-6078.002.patch

In the new patch I have added a unit test. I also discovered an issue with the 
previous patch, which is that if destroying the child shell causes the 
LocalizerRunner run method to reach its finally block before the subsequent 
super.interrupt() occurs, the interrupt may prevent the rest of the cleanup 
from being performed. The new patch only propagates the interrupt when a shell 
hasn't successfully been destroyed.

> Containers stuck in Localizing state
> ------------------------------------
>
>                 Key: YARN-6078
>                 URL: https://issues.apache.org/jira/browse/YARN-6078
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Billie Rinaldi
>         Attachments: YARN-6078.001.patch, YARN-6078.002.patch
>
>
> I encountered an interesting issue in one of our Yarn clusters (where the 
> containers are stuck in localizing phase).
> Our AM requests a container, and starts a process using the NMClient.
> According to the NM the container is in LOCALIZING state:
> {code}
> 1. 2017-01-09 22:06:18,362 [INFO] [AsyncDispatcher event handler] 
> container.ContainerImpl.handle(ContainerImpl.java:1135) - Container 
> container_e03_1481261762048_0541_02_000060 transitioned from NEW to LOCALIZING
> 2017-01-09 22:06:18,363 [INFO] [AsyncDispatcher event handler] 
> localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:711)
>  - Created localizer for container_e03_1481261762048_0541_02_000060
> 2017-01-09 22:06:18,364 [INFO] [LocalizerRunner for 
> container_e03_1481261762048_0541_02_000060] 
> localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1191)
>  - Writing credentials to the nmPrivate file 
> /../..//.nmPrivate/container_e03_1481261762048_0541_02_000060.tokens. 
> Credentials list:
> {code}
> According to the RM the container is in RUNNING state:
> {code}
> 2017-01-09 22:06:17,110 [INFO] [IPC Server handler 19 on 8030] 
> rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - 
> container_e03_1481261762048_0541_02_000060 Container Transitioned from 
> ALLOCATED to ACQUIRED
> 2017-01-09 22:06:19,084 [INFO] [ResourceManager Event Processor] 
> rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:410) - 
> container_e03_1481261762048_0541_02_000060 Container Transitioned from 
> ACQUIRED to RUNNING
> {code}
> When I click the Yarn RM UI to view the logs for the container,  I get an 
> error
> that
> {code}
> No logs were found. state is LOCALIZING
> {code}
> The Node manager 's stack trace seems to indicate that the NM's 
> LocalizerRunner is stuck waiting to read from the sub-process's outputstream.
> {code}
> "LocalizerRunner for container_e03_1481261762048_0541_02_000060" #27007081 
> prio=5 os_prio=0 tid=0x00007fa518849800 nid=0x15f7 runnable 
> [0x00007fa5076c3000]
>    java.lang.Thread.State: RUNNABLE
>       at java.io.FileInputStream.readBytes(Native Method)
>       at java.io.FileInputStream.read(FileInputStream.java:255)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>       - locked <0x00000000c6dc9c50> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
>       at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>       at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>       at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>       - locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
>       at java.io.InputStreamReader.read(InputStreamReader.java:184)
>       at java.io.BufferedReader.fill(BufferedReader.java:161)
>       at java.io.BufferedReader.read1(BufferedReader.java:212)
>       at java.io.BufferedReader.read(BufferedReader.java:286)
>       - locked <0x00000000c6dc9c78> (a java.io.InputStreamReader)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
>       at org.apache.hadoop.util.Shell.run(Shell.java:479)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
> {code}
> I did a {code}ps aux{code} and confirmed that there was no container-executor 
> process running with INITIALIZE_CONTAINER that the localizer starts. It seems 
> that the output stream pipe of the process is still not closed (even though 
> the localizer process is no longer present).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to