[ 
https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2612:
---------------------------
    Attachment:     (was: YARN-2612.patch)

> Some completed containers are not reported to NM
> ------------------------------------------------
>
>                 Key: YARN-2612
>                 URL: https://issues.apache.org/jira/browse/YARN-2612
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>             Fix For: 2.6.0
>
>
> We are testing RM work preserving restart and found the following logs when 
> we ran a simple MapReduce task "PI". Some completed containers which already 
> pulled by AM never reported back to NM, so NM continuously report the 
> completed containers while AM had finished. 
> {code}
> 2014-09-26 17:00:42,228 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:42,228 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:43,230 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:43,230 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:44,233 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> 2014-09-26 17:00:44,233 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> {code}
> In YARN-1372, NM will report completed containers to RM until it gets ACK 
> from RM.  If AM does not call allocate, which means that AM does not ack RM, 
> RM will not ack NM. We([~chenchun]) have observed these two cases when 
> running Mapreduce task 'pi':
> 1) RM sends completed containers to AM. After receiving it, AM thinks it has 
> done the work and does not need resource, so it does not call allocate.
> 2) When AM finishes, it could not ack to RM because AM itself has not 
> finished yet.
> In order to solve this problem, we have two solutions:
> 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then 
> RM could send this AppAttempt's completed containers to NM.
> 2) In  FairScheduler#nodeUpdate, if completed containers sent by NM does not 
> have corresponding RMContainer, RM just ack it to NM.
> We prefer to solution 2 because it is more clear and concise. However RM 
> might ack same completed containers to NM many times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to