[ 
https://issues.apache.org/jira/browse/YARN-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602892#comment-13602892
 ] 

Eli Reisman commented on YARN-477:
----------------------------------

Duh. Too many issues fixed all at once, they are all running together in my 
mind. OK, going over this again, this is happening during my integration tests 
with MiniYARNCluster, not on the real cluster.

So perhaps the real YARN implementation handles propagating the error to the 
client and RM (etc) when the command line the client tries to use to launch the 
container for the AM fails. I think its the MiniYARNCluster that is not 
handling this situation correctly.

Again, the issue is:

Client starts fine. Creates AMContainerSpec stuff and tries to request AM 
container. This request includes the shell command to launch our AM in the 
container. Container shows up as being granted and provisioned by RM, but from 
there the client hangs waiting for job success/fail, saying it has "1 container 
used" the whole time (the AM failed container.) What seems to be happening is 
this shell script fails in launching the AM in its container, so the container 
just sits there forever. Lets check this in MiniYARNCluster and see.

I will try to "break" the Giraph MiniYARNCluster test again and recreate some 
decent log traces leading up to the event and I will post here. Thanks!
                
> When default container executor fails right away, at the CLI launching our 
> App Master, Client doesn't always get the signal to kill the job
> -------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-477
>                 URL: https://issues.apache.org/jira/browse/YARN-477
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eli Reisman
>            Assignee: Zhijie Shen
>
> I have been porting Giraph to YARN (GIRAPH-13 is the issue) and when I launch 
> my App Master, if the container command line runs it successfully, any 
> failure in the App Master or my launched Giraph Tasks promptly reports to 
> Client and ends my job run. However, if the command line sent to the app 
> master container fails to launch it at all, the error exit code is not 
> propagating. My client hangs with the job at containersUsed == 1 and state == 
> ACCEPTED for as long as you want to sit and wait before CTRL-C'ing your way 
> out.
> Disclaimer: this could be my fault. But I wanted to throw it out there in 
> case its not. I also (when this happens) not getting error logs since the app 
> master never launched, so I really have no visibility into why it failed to 
> launch. I am sure its not launching, but the client IS sending the app 
> request, getting a container for my AM, and I see the command line run on the 
> container in my logs. Thats all.
> Thanks! If this is a dup or "won't fix" for some reason, let me know and 
> sorry for wasting your time!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to