[ 
https://issues.apache.org/jira/browse/YARN-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947326#comment-15947326
 ] 

Jason Lowe commented on YARN-6168:
----------------------------------

This sounds like the RM isn't waiting long enough for all the live NMs to 
report in before reporting the live containers to the app.  Technically it 
would have to wait up to the full NM expiry interval before it could know for 
sure no more containers are going to be reported by late-heartbeating NMs, so 
once fix would be to hold off AM restarts of container-preserving apps after an 
RM restart until the NM expiry interval has passed since restart.  However I 
don't know if apps are willing to wait that long before their AM recovers.  If 
not then there is always going to be the possibility that not all live 
containers are reported when the AM restarts and registers if an NM ends jup 
heartbeating late.




> Restarted RM may not inform AM about all existing containers
> ------------------------------------------------------------
>
>                 Key: YARN-6168
>                 URL: https://issues.apache.org/jira/browse/YARN-6168
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Billie Rinaldi
>
> There appears to be a race condition when an RM is restarted. I had a 
> situation where the RMs and AM were down, but NMs and app containers were 
> still running. When I restarted the RM, the AM restarted, registered with the 
> RM, and received its list of existing containers before the NMs had reported 
> all of their containers to the RM. The AM was only told about some of the 
> app's existing containers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to