Junping Du commented on YARN-914:

Thanks for review and comments, [~xgong], [~jlowe] and [~mingma]!

bq. I believe this is about the configuration synchronization between multiple 
RM nodes. Please take a look at 
https://issues.apache.org/jira/browse/YARN-1666, and 
Thanks for pointing this out. Sounds like we already resolve most of the 
problem, good to know it. :)

bq. Do we really need to handle the "LRS containers" and "short-term 
containers" differently? There are lots of different cases we need to take 
care. I think that we can just use the same way to handle both.
I haven't think through this yet. IMO, the benefit of this feature is to 
provide a reasonable time window for running applications get chance to finish 
before nodes get decommissioned. Given the endless live cycle of LRS 
containers, I didn't see the benefit to keep LRS containers running until 
timeout but only delay the decommission process. Or we assume AM can do some 
react to LRS containers when get notified? May be for the first step, we can do 
the same thing for LRS and non-LRS containers to keep it simple. But I think we 
should keep mind open for this.

bq. Maybe we need to track the timeout at RM side and NM side. RM can stop NM 
if the timeout is reached but it does not receive the "decommission complete" 
from NM.
Sounds reasonable given possible broken communication between NM and RM. 
However, as Jason Lowe proposed below, we can only track in RM side. Thoughts?

bq. For transferring knowledge to the standby RM, we could persist the graceful 
decomm node list to the state store.
Yes. Sounds like most of work already be done in YARN-1666 (decommission node 
list) and YARN-1611 (timeout value) like @Xuan mentioned above. The only left 
work here is to keep track of start time of each decommissioning node. Isn't it?

bq. I agree with Xuan that so far I don't see a need to treat LRS and normal 
containers separately. Either a container exits before the decommission timeout 
or it doesn't.
Just like we want decommission happen before timeout if all containers and apps 
are finished, we don't want unnecessary time cost for delay the decommission 
process. Isn't it? However, it could be the other case if we think the delay 
can help to LRS application. Anyway, like mentioned above, it should be fine to 
keep the same behavior for now but I think we may need to keep mind on it.  

bq. Just to be clear, the NM is already tracking which applications are active 
on a node and is reporting these to the RM on heartbeats (see NM context and 
NodeStatusUpdaterImpl appTokenKeepAliveMap). The DecommissionService doesn't 
need to explicitly track the apps itself as this is already being done.
Yes. The diagram not only include the new components but also existing 
components. Thanks for reminding on this though. 

bq. As for doing this RM side or NM side, I think it can simplify things if we 
do this on the RM side. The RM already needs to know about graceful 
decommission to avoid scheduling new apps/containers on the node. Also the NM 
is heartbeating active apps back to the RM, so it's easy for the RM to track 
which apps are still active on a particular node. If the RMNodeImpl state 
machine sees that it's in the decommissioning state and all apps/containers 
have completed then it can transition to the decommissioned state. For timeouts 
the RM can simply set a timer-delivered event to the RMNode when the graceful 
decommission starts, and the RMNode can act accordingly when the timer event 
arrives, killing containers etc. Actually I'm not sure the NM needs to know 
about graceful decommission at all, which IMHO simplifies the design since only 
one daemon needs to participate and be knowledgeable of the feature. The NM 
would simply see the process as a reduction in container assignments until 
eventually containers are killed and the RM tells it that it's decommissioned.
That make sense. In addition, I think even RMNode doesn't have to track time 
themselves (or in worst case, thousands of threads need to access time), and we 
can have something like DecommssionTimeoutMonitor that derived from 
AbstractLivelinessMonitor. When detected timeout, it can send out 
decommssion_timeout event to RMNode to make node shutdown happens. Also, I 
agree that NM may not necessary to aware of this decommission_in_process.

bq. To clarify decomm node list, it appears there are two things, one is the 
decomm request list; another one is the run time state of the decomm nodes. 
From Xuan's comment it appears we want to put the request in HDFS and leverage 
FileSystemBasedConfigurationProvider to read it at run time. Given it is 
considered configuration, that seems a good fit. Jason mentioned the state 
store , that can be used to track the run time state of the decomm. This is 
necessary given we plan to introduce timeout for graceful decommission. 
However, if we assume ResouceOption's overcommitTimeout state is stored in 
state store for RM failover case as part YARN-291, then the new active RM can 
just replay the state transition. If so, it seems we don't need to persist 
decomm run time state to state store.
I think we may assume timeout (value specified in yarn properties) is 
consistent across the RM's failure over and we need to track each node's start 
time of decommissioning. It reminds me that putting timeout value on 
yarn-site.xml may not be a good idea if we want to update it at runtime as RM 
only load yarn-site configuration during its start. However, I think we can 
override this value with new parameters in command line of Refresh node.  

bq. Alternatively we can remove graceful decommission timeout for YARN layer 
and let external decommission script handle that. If the script considers the 
graceful decommission takes too long, it can ask YARN to do the immediate 
Aha~~, that make things much simpler. However, if we don't have that script, it 
could make decommission work looks like an endless process. May be we shouldn't 
assume some external script can be there? No?

bq. BTW, it appears fair scheduler doesn't support ConfigurationProvider.
Not sure. Need to check it later. May be some FairScheduler guys can comment 

bq. Recommission is another scenario. It can happen when node is in 
decommissioned state or decommissioned_in_progress state.
Agree. It should be addressed already in document.

> Support graceful decommission of nodemanager
> --------------------------------------------
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
>         Attachments: Gracefully Decommission of NodeManager (v1).pdf
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.

This message was sent by Atlassian JIRA

Reply via email to