[ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001725#comment-15001725
 ] 

sandflee commented on YARN-4051:
--------------------------------

thanks [~jlowe]

Should the value be infinite by default? The concern is that if one container 
has issues recovering (due to log aggregation woes or whatever) then we risk 
expiring all of the containers on this node if we don't re-register with the RM 
within the node expiry interval. I think it makes sense if we have also fixed 
the recovery paths so there aren't potentially long-running procedures (like 
contacting HDFS) during the recovery process. If we haven't then we could 
create as many problems as we're solving by waiting forever.
-- aggree ! I also concern this.

Why does the patch change the check interval? If it's to reduce the logging 
then we can better fix that by only logging when the status changes rather than 
every iteration.
---yes, it's to reduce the log, since recovery is almost very fast, change it 
back

 Nit: A value of zero should also be treated as a disabled max time.
--  zero is to register to register to rm at once whether nm complete recover 
or  not,yes?

Nit: "Max time to wait NM to complete container recover before register to RM " 
should be "Max time NM will wait to complete container recovery before 
registering with the RM".
-- corrected



> ContainerKillEvent is lost when container is  In New State and is recovering
> ----------------------------------------------------------------------------
>
>                 Key: YARN-4051
>                 URL: https://issues.apache.org/jira/browse/YARN-4051
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: sandflee
>            Assignee: sandflee
>            Priority: Critical
>         Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to