[
https://issues.apache.org/jira/browse/YARN-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001309#comment-15001309
]
Jason Lowe commented on YARN-4051:
----------------------------------
Thanks for updating the patch!
Should the value be infinite by default? The concern is that if one container
has issues recovering (due to log aggregation woes or whatever) then we risk
expiring all of the containers on this node if we don't re-register with the RM
within the node expiry interval. I think it makes sense if we have also fixed
the recovery paths so there aren't potentially long-running procedures (like
contacting HDFS) during the recovery process. If we haven't then we could
create as many problems as we're solving by waiting forever.
Why does the patch change the check interval? If it's to reduce the logging
then we can better fix that by only logging when the status changes rather than
every iteration.
Nit: A value of zero should also be treated as a disabled max time.
Nit: "Max time to wait NM to complete container recover before register to RM "
should be "Max time NM will wait to complete container recovery before
registering with the RM".
> ContainerKillEvent is lost when container is In New State and is recovering
> ----------------------------------------------------------------------------
>
> Key: YARN-4051
> URL: https://issues.apache.org/jira/browse/YARN-4051
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: sandflee
> Assignee: sandflee
> Priority: Critical
> Attachments: YARN-4051.01.patch, YARN-4051.02.patch,
> YARN-4051.03.patch, YARN-4051.04.patch
>
>
> As in YARN-4050, NM event dispatcher is blocked, and container is in New
> state, when we finish application, the container still alive even after NM
> event dispatcher is unblocked.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)