[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

Jason Lowe (JIRA) Wed, 11 Nov 2015 15:14:50 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001309#comment-15001309
 ]


Jason Lowe commented on YARN-4051:
----------------------------------

Thanks for updating the patch!

Should the value be infinite by default?  The concern is that if one container 
has issues recovering (due to log aggregation woes or whatever) then we risk 
expiring all of the containers on this node if we don't re-register with the RM 
within the node expiry interval.  I think it makes sense if we have also fixed 
the recovery paths so there aren't potentially long-running  procedures (like 
contacting HDFS) during the recovery process.  If we haven't then we could 
create as many problems as we're solving by waiting forever.

Why does the patch change the check interval?  If it's to reduce the logging 
then we can better fix that by only logging when the status changes rather than 
every iteration.

Nit: A value of zero should also be treated as a disabled max time.

Nit: "Max time to wait NM to complete container recover before register to RM " 
should be "Max time NM will wait to complete container recovery before 
registering with the RM".

> ContainerKillEvent is lost when container is  In New State and is recovering
> ----------------------------------------------------------------------------
>
>                 Key: YARN-4051
>                 URL: https://issues.apache.org/jira/browse/YARN-4051
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: sandflee
>            Assignee: sandflee
>            Priority: Critical
>         Attachments: YARN-4051.01.patch, YARN-4051.02.patch, 
> YARN-4051.03.patch, YARN-4051.04.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New 
> state, when we finish application, the container still alive even after NM 
> event dispatcher is unblocked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4051) ContainerKillEvent is lost when container is In New State and is recovering

Reply via email to