[
https://issues.apache.org/jira/browse/YARN-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854538#comment-15854538
]
Naganarasimha G R commented on YARN-6147:
-----------------------------------------
[~varun_saxena], There are 2 cases when NM becomes inaccessible,
# AMLivelinessMonitor expires the AM when launching of AM by AMlauncher is
successfull but NM goes down by the time AM process is started
# AM container is allocated to a Node and later when the NM becomes unhealthy
(abrupt shutdown) then AMLauncher fails to connect to NM and launch the
container.
Earlier allocation to the same NM (before NMLivelinessMonitor expires the NM)
is rarely possible(cases where NM is facing transient failure) as scheduling to
the node happens only on hearbeat but with advent of Global Scheduling we will
come across it more often !
> Blacklisting nodes not happening for AM containers
> --------------------------------------------------
>
> Key: YARN-6147
> URL: https://issues.apache.org/jira/browse/YARN-6147
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Naganarasimha G R
> Assignee: Naganarasimha G R
>
> Black Listing of nodes are not happening in the following scenarios
> 1. RMAppattempt is in ALLOCATED and LAUNCH_FAILED event comes when NM is down.
> 2. RMAppattempt is in LAUNCHED and EXPIRE event comes when NM is down.
> In both these cases AppAttempt goes to *FINAL_SAVING* and eventually to
> *FINAL* state before *CONTAINER_FINISHED* event is triggered by
> {{RMContainerImpl}} and in the {{FINAL}} state {{CONTAINER_FINISHED}} event
> is ignored.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]