Peng Zhang commented on YARN-3535:

As per [~jlowe]'s thoughts, I understand here are two separated thing:
# During NM reconnection, RM and NM should do sync at container level. For this 
issue's scenario, container 000004 should not be killed and rescheduled, so AM 
can acquire and launch it  on NM after NM registered.
# Still need fix in RMContainerImpl: restore request during transition from  
ALLOCATED to KILLED. Because NM's real lost may cause transition from ALLOCATED 
to KILLED with very small possibility(AM may heartbeat and acquire container 
after NM heartbeats timeout).

I think first thing is an improvement to save time or scheduling work done 
before. Or did I get any mistake? 

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>         Attachments: syslog.tgz, yarn-app.log
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.

This message was sent by Atlassian JIRA

Reply via email to