[ 
https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2821:
--------------------------------
    Attachment: apache-yarn-2821.1.patch

Thanks for the review Jian! I thought about changing the comparison but it 
feels like treating the symptom. I'd like to get it to work right without 
changing that if possible. 

Thanks for pointing out the increment in onStartContainerError, I've addressed 
that as well as made some more fixes in the latest patch.

> Distributed shell app master becomes unresponsive sometimes
> -----------------------------------------------------------
>
>                 Key: YARN-2821
>                 URL: https://issues.apache.org/jira/browse/YARN-2821
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications/distributed-shell
>    Affects Versions: 2.5.1
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: apache-yarn-2821.0.patch, apache-yarn-2821.1.patch
>
>
> We've noticed that once in a while the distributed shell app master becomes 
> unresponsive and is eventually killed by the RM. snippet of the logs -
> {noformat}
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: 
> appattempt_1415123350094_0017_000001 received 0 previous attempts' running 
> containers on AM registration.
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : 
> onprem-tez2:45454
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, allocatedCnt=1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000002, 
> containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
> onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
> onprem-tez4:45454
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, allocatedCnt=3
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000003, 
> containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000004, 
> containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000005, 
> containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez3:45454
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, completedCnt=1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: 
> appattempt_1415123350094_0017_000001 got container status for 
> containerID=container_1415123350094_0017_01_000002, state=COMPLETE, 
> exitStatus=0, diagnostics=
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Container 
> completed successfully., containerId=container_1415123350094_0017_01_000002
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, allocatedCnt=2
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000006, 
> containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000007, 
> containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000007
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000006
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to