[
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862793#comment-16862793
]
Abhishek Modi commented on YARN-9608:
-------------------------------------
Thanks [~tangzhankun] for going through patch:
{quote} # If there's a long-running Spark shell application A of YARN cluster
mode, only can the timeout cause the decommissioning node 1 (app A's container
ran on it previously, but A's AM running on node 2) to shut down, right?{quote}
Yes, in this case only timeout or application finish can cause the
decommissioning to complete. This behavior would be similar to the behavior in
case this node was put in decommissioning state when container for app A was
running on the node.
{quote} And if node 1 is shut down due to timeout, and when node 1 is
re-registered in the future, will the node 1 still be considered belongs to
running application A?
{quote}
No, if node was shut down when no container was running on the node it won't
be considered belonging to app A. But in case, work preserving node manager was
enabled and a container was recovered on that node for app A, it will be
considered to be running app A.
> DecommissioningNodesWatcher should get lists of running applications on node
> from RMNode.
> -----------------------------------------------------------------------------------------
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Abhishek Modi
> Assignee: Abhishek Modi
> Priority: Major
> Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications
> and triggers decommission of nodes when all the applications that ran on the
> node completes. This Jira proposes to solve following problem:
> # DecommissioningNodesWatcher skips tracking application containers on a
> particular node before the node is in DECOMMISSIONING state. It only tracks
> containers once the node is in DECOMMISSIONING state. This can lead to
> shuffle data loss of apps whose containers ran on this node before it was
> moved to decommissioning state.
> # It is keeping track of running apps. We can leverage this directly from
> RMNode.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]