[
https://issues.apache.org/jira/browse/YARN-9608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862789#comment-16862789
]
Zhankun Tang commented on YARN-9608:
------------------------------------
[~abmodi], Thanks. Just read through the whole patch. Two questions:
1. If there's a long-running Spark shell application A of YARN cluster mode,
only can the timeout cause the decommissioning node 1 (app A's container ran on
it previously, but A's AM running on node 2) to shut down, right?
2. And if node 1 is shut down due to timeout, and when node 1 is re-registered
in the future, will the node 1 still be considered belongs to running
application A?
> DecommissioningNodesWatcher should get lists of running applications on node
> from RMNode.
> -----------------------------------------------------------------------------------------
>
> Key: YARN-9608
> URL: https://issues.apache.org/jira/browse/YARN-9608
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Abhishek Modi
> Assignee: Abhishek Modi
> Priority: Major
> Attachments: YARN-9608.001.patch, YARN-9608.002.patch
>
>
> At present, DecommissioningNodesWatcher tracks list of running applications
> and triggers decommission of nodes when all the applications that ran on the
> node completes. This Jira proposes to solve following problem:
> # DecommissioningNodesWatcher skips tracking application containers on a
> particular node before the node is in DECOMMISSIONING state. It only tracks
> containers once the node is in DECOMMISSIONING state. This can lead to
> shuffle data loss of apps whose containers ran on this node before it was
> moved to decommissioning state.
> # It is keeping track of running apps. We can leverage this directly from
> RMNode.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]