[ 
https://issues.apache.org/jira/browse/YARN-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108576#comment-17108576
 ] 

Jim Brennan commented on YARN-9809:
-----------------------------------

I would like to revive this discussion.  We have this implemented internally.   
Our health-check script runs very quickly, so the impact on the time it takes 
to register with the RM is minimal (not really noticeable in our case).   Our 
health-check script does a number of checks to validate the health of the host 
the NM is running on.  We don't have any checks directly related to 
success/failure of containers launching since the last check, but even if we 
did, that particular check just wouldn't find anything if no containers have 
been launched yet.

The cases that we were trying to address with this change involve hardware or 
os issues with the node that may not prevent a container from launching, but 
are serious enough to mark the node as unhealthy (memory/disk errors, etc...).  
 We have seen this during a rolling upgrade.  Nodes that had been previously 
marked as unhealthy would be brought up as part of the RU, and those nodes 
would start running containers only to be marked unhealthy 10 minutes later 
when the health-check script ran.   This caused a lot of killed task attempts.  
 With large clusters there can be hundreds of nodes that are unhealthy, so 
there can be a lot of failed task attempts.

It seems the main question that [~eyang] is raising is whether we should allow 
a synchronous call to run the health-check script during nodemanager 
startup/registration.  I agree that this can introduce a potential slowdown if 
the health-check-script is slow.   In our case, the delay is not noticeable, 
and we think it is worth it to prevent the false start.   What do others think?

cc: [~ebadger], [~eyang], [~ccondit-target], [~shaneku...@gmail.com]

> NMs should supply a health status when registering with RM
> ----------------------------------------------------------
>
>                 Key: YARN-9809
>                 URL: https://issues.apache.org/jira/browse/YARN-9809
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>
> Currently if the NM registers with the RM and it is unhealthy, it can be 
> scheduled many containers before the first heartbeat. After the first 
> heartbeat, the RM will mark the NM as unhealthy and kill all of the 
> containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to