[ 
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862793#comment-16862793
 ] 

Abhishek Modi commented on YARN-9608:
-------------------------------------

Thanks [~tangzhankun] for going through patch:
{quote} # If there's a long-running Spark shell application A of YARN cluster 
mode, only can the timeout cause the decommissioning node 1 (app A's container 
ran on it previously, but A's AM running on node 2) to shut down, right?{quote}
Yes, in this case only timeout or application finish can cause the 
decommissioning to complete. This behavior would be similar to the behavior in 
case this node was put in decommissioning state when container for app A was 
running on the node.
{quote} And if node 1 is shut down due to timeout, and when node 1 is 
re-registered in the future, will the node 1 still be considered belongs to 
running application A?
{quote}
  No, if node was shut down when no container was running on the node it won't 
be considered belonging to app A. But in case, work preserving node manager was 
enabled and  a container was recovered on that node for app A, it will be 
considered to be running app A.

> DecommissioningNodesWatcher should get lists of running applications on node 
> from RMNode.
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-9608
>                 URL: https://issues.apache.org/jira/browse/YARN-9608
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Abhishek Modi
>            Assignee: Abhishek Modi
>            Priority: Major
>         Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications 
> and triggers decommission of nodes when all the applications that ran on the 
> node completes. This Jira proposes to solve following problem:
>  # DecommissioningNodesWatcher skips tracking application containers on a 
> particular node before the node is in DECOMMISSIONING state. It only tracks 
> containers once the node is in DECOMMISSIONING state. This can lead to 
> shuffle data loss of apps whose containers ran on this node before it was 
> moved to decommissioning state.
>  # It is keeping track of running apps. We can leverage this directly from 
> RMNode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to