[ 
https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087665#comment-14087665
 ] 

Jason Lowe commented on YARN-1337:
----------------------------------

I'm unable to reproduce these test failures locally.  Checking a few of the 
test failures show they are likely all failing because the machine can't lookup 
it's own name, e.g.: java.net.UnknownHostException: asf901.ygridcore.net: 
asf901.ygridcore.net.  I'll work with ops to get the machine fixed and rekick 
Jenkins.

> Recover containers upon nodemanager restart
> -------------------------------------------
>
>                 Key: YARN-1337
>                 URL: https://issues.apache.org/jira/browse/YARN-1337
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1337-v1.patch
>
>
> To support work-preserving NM restart we need to recover the state of the 
> containers when the nodemanager went down.  This includes informing the RM of 
> containers that have exited in the interim and a strategy for dealing with 
> the exit codes from those containers along with how to reacquire the active 
> containers and determine their exit codes when they terminate.  The state of 
> finished containers also needs to be recovered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to