[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

Daniel Zhi (JIRA) Sat, 30 Jul 2016 21:09:59 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400947#comment-15400947
 ]

Daniel Zhi commented on YARN-4676:
----------------------------------

Transcript of recently design discussions through email:

From: Junping Du [mailto:[email protected]] 
Sent: Friday, July 29, 2016 10:40 AM
To: Robert Kanter; Zhi, Daniel
Cc: Karthik Kambatla; Ming Ma
Subject: Re: YARN-4676 discussion

The plan sounds reasonable to me. Thanks Robert to coordinate on this.
I just commit YARN-5434 to branch-2.8. So Daniel, please start your rebase work 
in your earliest convenient time - I will try the functionality and review the 
code also. About previous questions, it is better to have such a discussions 
(with technical details) on the JIRA so we won't lose track in future and it is 
helpful to enhance your work's visibility as more audiences sooner or later to 
watch this.

________________________________________
From: Robert Kanter <[email protected]>
Sent: Friday, July 29, 2016 2:20 AM
To: Zhi, Daniel
Cc: Karthik Kambatla; Junping Du; Ming Ma
Subject: Re: YARN-4676 discussion 

Sorry, I used milliseconds instead of seconds, but it sounds like my example 
was clear enough despite that.  

In the interest of moving things along, let's try to get YARN-4676 committed as 
soon as possible, and then take care of the remaining items as followup JIRAs.  
It sounds like some of these require more discussion, and it might be easier 
this way instead of holding up YARN-4676 until everything is perfect.  So let's 
do this:
1.      Junping will commit YARN-5434 to add the -client|server arguments 
tomorrow (Friday); for 2.8+
2.      Daniel will rebase YARN-4676 on top of the latest, and make any minor 
changes due to YARN-5434; for 2.9+
o       Junping and I can do a final review and hopefully commit early next week
3.      We'll open followup JIRAs for the following items where we can 
discuss/implement more:
1.      Abstract out the host file format to allow the txt format, XML format, 
and JSON format
2.      Figure out if we want to change the behavior of subsequent parallel 
calls to gracefully decom nodes
Does that sound like a good plan?  
Did I miss anything?

On Tue, Jul 26, 2016 at 10:50 PM, Zhi, Daniel <[email protected]> wrote:
Karthik: the timeout currently is always in unit of second, both the command 
line arg and internal variables. Robert’s example should be “-refreshNodes –t 
30”.  

Robert: the timeout overwrite behavior you observed matches code logic inside 
NodesListManager.java. The first “-refreshNodes -t 120” sets 120 on A. The 
second “-refreshNodes –t 30” sets 30 on both A and B. In this case, there is no 
timeout in the exclude file so timeoutToUse is the timeout (120 and 30), A’s 
timeout being updated to 30 by line 287 + 311. To some extent, it is by design 
as both refreshes refer to the set of hosts inside exclude host file. 
Specifically the second refresh is about both A and B instead of B only, 
despite your intention is about B. In your case, if the timeout was specified 
inside the exclude host file, it will work as you expected.

Although the code could be changed to not update existing node timeout unless 
such timeout is from the host file (per node overwrite), I couldn’t tell 
quickly whether it is better given possible other side effect. This is 
something we can evaluate more.    

252   private void handleExcludeNodeList(boolean graceful, Integer timeout) {

276           // Use per node timeout if exist otherwise the request timeout.
277           Integer timeoutToUse = (timeouts.get(n.getHostName()) != null)?
278               timeouts.get(n.getHostName()) : timeout;

283           } else if (s == NodeState.DECOMMISSIONING &&
284                      !Objects.equals(n.getDecommissioningTimeout(),
285                          timeoutToUse)) {
286             LOG.info("Update " + nodeStr + " timeout to be " + 
timeoutToUse);
287             nodesToDecom.add(n);

311         e = new RMNodeDecommissioningEvent(n.getNodeID(), timeoutToUse);

From: Karthik Kambatla [mailto:[email protected]] 
Sent: Tuesday, July 26, 2016 6:08 PM
To: Robert Kanter; Zhi, Daniel
Cc: Junping Du; Ming Ma
Subject: Re: YARN-4676 discussion

Related, but orthogonal. 

Do we see value in using milliseconds for the timeout? Should it be seconds?
On Tue, Jul 26, 2016 at 5:55 PM Robert Kanter <[email protected]> wrote:
I spoke with Junping and Karthik earlier today.  We agreed that to simplify 
support for client-side tracking and server-side tracking, we should add a 
required -client|server argument.  We'll add this in 2.8 to be compatible with 
the server-side tracking, when it's ready (probably 2.9); in the meantime, the 
-server flag will throw an Exception.  This will help simplify things a little 
in that client-side tracking and server-side tracking are now mutually 
exclusive.  

For example
   yarn rmadmin -refreshNodes -g 1000 -client
will do client-side tracking, while
   yarn rmadmin -refreshNodes -g 1000 -server
will do server-side tracking (once committed).

I've created YARN-5434 to implement the -client|server argument (and fix the 
timeout argument not actually being optional).  @Junping, please review it.

Once we get YARN-5434 committed, @Daniel, can you update YARN-4676 to support 
this?  I think that should only require minimal changes to the CLI, and there 
was that 5 second overhead we probably don't need anymore.  

Long term, I agree that being able to specify the nodes on the CLI would be 
easier.  I think that's something we should revisit later, once we get the 
current work done.  

One other thing I noticed, and just double-checked, is that the behavior of 
subsequent graceful decom commands causes the previous nodes to update their 
timeout.  An example, to clarify:
Setup: nodeA and nodeB are in the include file
1. Start a long-running job that has containers running on nodeA
2. Add nodeA to the exclude file
3. Run '-refreshNodes -g 120000' (2min) to begin gracefully decommissioning 
nodeA.  The CLI blocks, waiting.
4. Wait 30 seconds.  Above CLI command should have 90 seconds left
4. Add nodeB to the exclude file
5. Run 'refreshNodes -g 30000' (30sec) in another window
6. Wait 29 seconds.  Original CLI command should have 91 seconds left, the new 
command should have 1 second
7. After 1 more second, both nodeA and nodeB shutdown and both CLI commands exit

It would be better if the second command had no bearing on the first command.  
So in the above example, only nodeB should have shutdown at the 7th step, while 
nodeA should have shut down 90 seconds later.  Otherwise, you can't concurrent 
graceful decoms.  If the user wants to change the timeout of an existing decom, 
they can recommission it, and then gracefully decommission it again with the 
new timeout.

Does that make sense?  Did I do something wrong here?  

On Tue, Jul 26, 2016 at 9:55 AM, Zhi, Daniel <[email protected]> wrote:
DecommissioningNodesWatcher only tracks nodes in DECOMMISSIONING state during 
the decommissioning period. It supposes to be near zero cost for all other 
nodes and times.

Upon RM restart, NodesListManager will decommission all hosts that appear in 
the exclude host file. That behavior remains the same without or with 
YARN-4676. So if RM restart in the middle of “client-side tracking”, the nodes 
will become decommissioned right away, instead of waiting for client to 
decommission them (it does not even know there is such an client). So overall 
the restart scenario support has more to do with persistent and recover node 
states instead of client-side vs. server-side tracking. 

So I don’t see strong arguments of “client-side tracking” from the performance 
impact and restart perspectives. That said, my real concern is how to support 
both in RM side so that the code is not too complicated or at least still 
remain compatible. For example: the following code inside RMNodeImpl, although 
appear under construction, is likely how node become decommissioned under 
“client-side tracking”. It is acting on the StatusUpdate event that is 
different than DecommissioningNodesWatcher acting on the NodeManager heartbeat 
messages. I don’t fully understand the StatusUpdate event mechanism but found 
it was conflicting with DecommissioningNodesWatcher. Are these code need to be 
restored within an if(clientSideTracking) block? There could be other state 
transition or event related code that affects state transition, counters, 
resources etc that need to be compatible and verified. Things could get tricky 
and complicated, adding to the complexity of supporting both. 

1134   public static class StatusUpdateWhenHealthyTransition implements

1159       if (isNodeDecommissioning) {
1160         List<ApplicationId> runningApps = rmNode.getRunningApps();
1161
1162         List<ApplicationId> keepAliveApps = 
statusEvent.getKeepAliveAppIds();
1163
1164         // no running (and keeping alive) app on this node, get it
1165         // decommissioned.
1166         // TODO may need to check no container is being scheduled on this 
node
1167         // as well.
1168         if ((runningApps == null || runningApps.size() == 0)
1169             && (keepAliveApps == null || keepAliveApps.size() == 0)) {
1170           RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
1171           return NodeState.DECOMMISSIONED;
1172         }
1173
1174         // TODO (in YARN-3223) if node in decommissioning, get node 
resource
1175         // updated if container get finished (keep available resource to 
be 0)
1176       }

From: Junping Du [mailto:[email protected]] 
Sent: Tuesday, July 26, 2016 7:46 AM
To: Zhi, Daniel; Ming Ma; Robert Kanter
Cc: Karthik Kambatla
Subject: Re: YARN-4676 discussion

In my understanding, the values for client tracking are:
- We may not want RM to track things which is not completely necessary in a 
very large scale yarn cluster where RM's resource is already the bottleneck. 
Although I don't think decommissioningNodesWatcher could consume too much 
resources, I would prefer to be a safe player to give user a chance to get rid 
of this pressure to RM.
- As a client side tracking tool, user won't expect the timeout tracking to be 
solid as rock in any case that means it must have a backup plan if tracking 
timeout is essentially important. i.e. it can get tracked also by cluster 
management software, like - Ambari or Cloudera Manager. That's why I would like 
to emphasize server-side tracking need to address RM restart case because 
user's expectations is different with different price we paid.
So I would prefer we could have both ways (client-side tracking and server-side 
tracking) for YARN user to choose for different scenarios. Do we have different 
opinion here?

I agree with Ming's point that add specifying host list in command line could 
be simpler and convenient in some cases, and I also agree it could be next step 
as we need to figure out way to persistent these ad-hoc nodes for 
decommission/recommission. Ming, would you like to open a JIRA for this 
discussion? Thanks!
 ________________________________________
From: Zhi, Daniel <[email protected]>
Sent: Tuesday, July 26, 2016 5:20 AM
To: Ming Ma; Robert Kanter
Cc: Karthik Kambatla; Junping Du; Zhi, Daniel
Subject: RE: YARN-4676 discussion 

The “server-side tracking” (client no wait) is friendly and necessary for 
cluster control software that decommission/recommission hosts automatically. 
The “client-side tracking” is friendly for admin to manually issue decommission 
command and wait for the completion. 

The issue that “-refreshNodes -g timeout” is always blocking could be addressed 
by an additional “-wait” or “-nowait” flag depends on the default behavior to 
choose. Whether it is sufficient or not depends on the purpose “client-side 
tracking” --- if the core purpose is to have blocking experience from client 
side and ensure forceful decommission after timeout as necessary, then the code 
is already doing it. If there is other unspecified server-side value that I am 
not aware of, then more changes are needed to bridge the gap (for example: 
disengage  DecommissioningNodesWatcher inside RM as it will decommission the 
node when ready). Overall, can anyone clarify the exact purpose and scope of 
“client-side tracking”?   

For the scenario mentioned by Ming where admin prefer to directly supply host 
in the refresh command for convenience. I see the value to support it, however 
the server side logic need merge or consolidate such hosts with ones specified 
in exclude/include host file. For example, another --refreshNode could be 
invoked the traditional way with conflict setting about the host inside the 
exclude/include file. Besides, I am not sure whether a separate command 
–decommissionNodes would be better. Overall, it is an alternative convenient 
flavor to support, not necessarily core to “client-side tracking” vs 
“server-side tracking”.   

From: Ming Ma [mailto:[email protected]] 
Sent: Tuesday, July 19, 2016 5:47 PM
To: Robert Kanter
Cc: Zhi, Daniel; Karthik Kambatla; Junping Du
Subject: Re: YARN-4676 discussion

I just went over the discussion. It sort of reminded me of discussions we had 
when trying to add timeout support to HDFS datanode maintenance mode, 
specifically 
https://issues.apache.org/jira/browse/HDFS-7877?focusedCommentId=14739108&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14739108.
 We considered couple approaches.
•       Have admin tools outside hadoop core to manage timeout. There is no 
hadoop API to support timeout. The external tools need to track the state such 
as expiration and call hadoop to change the node to the final state when 
timeout.
•       Have hadoop service manage the timeout without persistence. Admin 
issues refreshNodes with timeout value and namenode keeps track of the 
expiration but will loose the data upon restart.
•       Have hadoop service manage the timeout with persistence. It means the 
timeout or expiration timestamp needs to be persisted somewhere.
Unless something else comes up, we plan to go with the last approach to support 
HDFS maintenance state. Admins have to use host files to config the expiration 
timestamp. (we discussed the idea of allowing admins to specify host name and 
timeout as part of dfsadmin -refreshNode, but that will require some major 
change in terms of how HDFS persists datanode properties.)

For yarn graceful decommission, it is similar in terms of the approaches we can 
take.

>From admin API's point of view, there are two approaches we can take:
•       Configure everything in the hosts file, including whether it is regular 
decommission or graceful decommission at individual host level. Thus client no 
longer needs to send RPC graceful decommission flag to RM.
•       Configure everything in refreshNode command such as "rmadmin 
-refreshNode -g -t 200 -host hostFoo". Our admins like this. This allows 
automation tool to decommission nodes from anything machines. However, this 
introduces new complexity in terms of how to persists node properties. This 
might be the long term solution for later.
Maybe we can also get input from Yahoo and Microsoft folks in terms of the 
admin usage by continuing the discussion in the jira?

Thanks all. Overall this feature is quite useful and we look forward to it.

On Fri, Jul 15, 2016 at 4:22 PM, Robert Kanter <[email protected]> wrote:
+Karthik, who I've also been discussing this with internally.

On Thu, Jul 14, 2016 at 6:41 PM, Robert Kanter <[email protected]> wrote:
Pushing the JSON format to a followup JIRA sounds like a good idea to me.  I 
agree that we should focus on keeping the scope limited enough for YARN-4676 so 
we can move forward.  

Besides having a way to disable/enable server-side tracking, I think it would 
make sense to have a way to disable/enable client-side tracking.  Daniel goes 
through the different scenarios, and there's a gap here.  What if the user 
wants to set a non-default timeout, but use server-side tracking only?  If they 
specify the timeout on the commandline, they'll have to do ctrl-c, which isn't 
the greatest user experience and not script-friendly.  

Perhaps we should have it so that client-side tracking is normally done, but if 
you specify a new optional argument (e.g. "--no-client-tracking" or something) 
it won't.  This way, we can support client-side only tracking and server-side 
only tracking (I suppose this would allow no tracking, but that wouldn't work 
right).  

Alternatively, we could make client-side and server-side tracking mutually 
exclusive, which might simplify things.  The default is client-side tracking 
only, but if you specify a new optional argument (e.g. "--server" or 
something), it will do server-side tracking only.  

Thoughts?  
I'm kind of leaning towards making them mutually exclusive because the behavior 
is easier to understand and more well defined.

- Robert

On Thu, Jul 14, 2016 at 2:49 PM, Zhi, Daniel <[email protected]> wrote:
(added Robert to the thread and updated subject)

Regarding client-side tracking, let me describe my observation of the behavior 
to ensure we are all on the same page. I have ran the “-refreshNodes -g 
[timeout in seconds]” command before during the patch iterations inside real 
cluster (except for the last patch on 06/12/2016 where I couldn’t build AMI as 
normal due to Java 8 related conflict with other component in our AMI rolling 
process).

I assume the interesting part for this conversation is when timeout value is 
specified. It will hit the refreshNodes(int timeout) function in 
RMAdminCLI.java which I pasted below for convenience. The logic submit 
RefreshNodesRequest request with the user timeout, which will be interpreted by 
the server side code as the default timeout (override the one in yarn-site.xml 
file). For individual node with explicit timeout specified in exclude host 
file, the specific timeout will be used. 

The server-side tracking supports (is compatible with) multiple overlapping 
refreshNodes requests. For example, at time T1, user might want graceful 
decommission 10 nodes, while they are still pending in decommissioning at time 
T2, user changed mind to decommission 5 nodes instead (5 nodes will be moved 
from exclude file and RM will bring them back to running), later at time T3, 
user are free to graceful decommission more or less node as they wish. The 
enhanced logic in NodesListManager.java compare the user intention and current 
state of the RMNode and issue proper actions (including dynamically modify the 
timeout) through events. 

The client-side tracking code will force decommission the nodes (line 362) upon 
timeout. The timeout was purely tracked by client before (client must keep 
running to enforce it). If the client somehow is terminated (for example Ctrl+C 
by user) before the timeout, then the timeout is not enforced. Further, if the 
same command is run again, the timeout clock resets. Further, user might have 
issued refreshNodes() without explicit timeout before, during which the default 
timeout will be used by RM, a new refreshNodes(timeout) request with 
client-side tracking might not lead to the desired timeout.

The current code (below) tries to balance the client-side tracking and server 
side logic so that they are mostly compatible. For the simple scenario where 
only client-side tracking is used (assume this is Junping’s scenario), I expect 
the user experience be similar. The client code below waits for 5 extra seconds 
before force decommission which usually not needed as the node will become 
decommissioned before or upon the timeout in RM side. The client will finish 
earlier should the node become decommissioned earlier. I have verified this is 
the behavior before but not in 06/12.

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, 
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch, 
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch, 
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, 
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch, 
> YARN-4676.015.patch, YARN-4676.016.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to 
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager 
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission 
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to 
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning 
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The 
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves 
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful 
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-4676) Automatic and Asynchronous Decommissioning Nodes Status Tracking

Reply via email to