Jason Lowe commented on YARN-1337:

I'm unable to reproduce these test failures locally.  Checking a few of the 
test failures show they are likely all failing because the machine can't lookup 
it's own name, e.g.: java.net.UnknownHostException: asf901.ygridcore.net: 
asf901.ygridcore.net.  I'll work with ops to get the machine fixed and rekick 

> Recover containers upon nodemanager restart
> -------------------------------------------
>                 Key: YARN-1337
>                 URL: https://issues.apache.org/jira/browse/YARN-1337
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1337-v1.patch
> To support work-preserving NM restart we need to recover the state of the 
> containers when the nodemanager went down.  This includes informing the RM of 
> containers that have exited in the interim and a strategy for dealing with 
> the exit codes from those containers along with how to reacquire the active 
> containers and determine their exit codes when they terminate.  The state of 
> finished containers also needs to be recovered.

This message was sent by Atlassian JIRA

Reply via email to