[ 
https://issues.apache.org/jira/browse/YARN-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543206#comment-14543206
 ] 

Jian He commented on YARN-2821:
-------------------------------

The current patch makes sense because there's no way to figure out previously 
finished apps other than persisting.  But I'm thinking if it is a bit over-kill 
to do this for an example app. One thing in my mind is that, inside the 
onContainersCompleted we can filter out the previously attempts' finished 
containers and only do it for current attempt's finished containers. And in the 
finish() method, we compare 
numFinishedContainersOfcurrentAttempt==numContainersAskedByCurrentAttempt; will 
this work ? I know doing this, the total finished containers may be larger than 
user specified, but given this is just an example app, maybe tolerable ? 

On the other hand, the current patch may not work on secure cluster because it 
communicates with hdfs. {{renameScriptFile}} method is an example to talk with 
hdfs.

> Distributed shell app master becomes unresponsive sometimes
> -----------------------------------------------------------
>
>                 Key: YARN-2821
>                 URL: https://issues.apache.org/jira/browse/YARN-2821
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications/distributed-shell
>    Affects Versions: 2.5.1
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-2821.002.patch, YARN-2821.003.patch, 
> apache-yarn-2821.0.patch, apache-yarn-2821.1.patch
>
>
> We've noticed that once in a while the distributed shell app master becomes 
> unresponsive and is eventually killed by the RM. snippet of the logs -
> {noformat}
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: 
> appattempt_1415123350094_0017_000001 received 0 previous attempts' running 
> containers on AM registration.
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:37 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[<memory:10, vCores:1>]Priority[0]
> 14/11/04 18:21:38 INFO impl.AMRMClientImpl: Received new token for : 
> onprem-tez2:45454
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, allocatedCnt=1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000002, 
> containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:38 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000002
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
> onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.AMRMClientImpl: Received new token for : 
> onprem-tez4:45454
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, allocatedCnt=3
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000003, 
> containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000004, 
> containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000005, 
> containerNode=onprem-tez4:45454, containerNodeURI=onprem-tez4:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> START_CONTAINER for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez3:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000005
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez4:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000003
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez2:45454
> 14/11/04 18:21:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: 
> QUERY_CONTAINER for Container container_1415123350094_0017_01_000004
> 14/11/04 18:21:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> onprem-tez3:45454
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, completedCnt=1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: 
> appattempt_1415123350094_0017_000001 got container status for 
> containerID=container_1415123350094_0017_01_000002, state=COMPLETE, 
> exitStatus=0, diagnostics=
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Container 
> completed successfully., containerId=container_1415123350094_0017_01_000002
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Got response from 
> RM for container ask, allocatedCnt=2
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000006, 
> containerNode=onprem-tez2:45454, containerNodeURI=onprem-tez2:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Launching shell 
> command on a new container., 
> containerId=container_1415123350094_0017_01_000007, 
> containerNode=onprem-tez3:45454, containerNodeURI=onprem-tez3:50060, 
> containerResourceMemory1024, containerResourceVirtualCores1
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000007
> 14/11/04 18:21:40 INFO distributedshell.ApplicationMaster: Setting up 
> container launch container for 
> containerid=container_1415123350094_0017_01_000006
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to