[
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400947#comment-15400947
]
Daniel Zhi commented on YARN-4676:
----------------------------------
Transcript of recently design discussions through email:
From: Junping Du [mailto:[email protected]]
Sent: Friday, July 29, 2016 10:40 AM
To: Robert Kanter; Zhi, Daniel
Cc: Karthik Kambatla; Ming Ma
Subject: Re: YARN-4676 discussion
The plan sounds reasonable to me. Thanks Robert to coordinate on this.
I just commit YARN-5434 to branch-2.8. So Daniel, please start your rebase work
in your earliest convenient time - I will try the functionality and review the
code also. About previous questions, it is better to have such a discussions
(with technical details) on the JIRA so we won't lose track in future and it is
helpful to enhance your work's visibility as more audiences sooner or later to
watch this.
________________________________________
From: Robert Kanter <[email protected]>
Sent: Friday, July 29, 2016 2:20 AM
To: Zhi, Daniel
Cc: Karthik Kambatla; Junping Du; Ming Ma
Subject: Re: YARN-4676 discussion
Sorry, I used milliseconds instead of seconds, but it sounds like my example
was clear enough despite that.
In the interest of moving things along, let's try to get YARN-4676 committed as
soon as possible, and then take care of the remaining items as followup JIRAs.
It sounds like some of these require more discussion, and it might be easier
this way instead of holding up YARN-4676 until everything is perfect. So let's
do this:
1. Junping will commit YARN-5434 to add the -client|server arguments
tomorrow (Friday); for 2.8+
2. Daniel will rebase YARN-4676 on top of the latest, and make any minor
changes due to YARN-5434; for 2.9+
o Junping and I can do a final review and hopefully commit early next week
3. We'll open followup JIRAs for the following items where we can
discuss/implement more:
1. Abstract out the host file format to allow the txt format, XML format,
and JSON format
2. Figure out if we want to change the behavior of subsequent parallel
calls to gracefully decom nodes
Does that sound like a good plan?
Did I miss anything?
On Tue, Jul 26, 2016 at 10:50 PM, Zhi, Daniel <[email protected]> wrote:
Karthik: the timeout currently is always in unit of second, both the command
line arg and internal variables. Robert’s example should be “-refreshNodes –t
30”.
Robert: the timeout overwrite behavior you observed matches code logic inside
NodesListManager.java. The first “-refreshNodes -t 120” sets 120 on A. The
second “-refreshNodes –t 30” sets 30 on both A and B. In this case, there is no
timeout in the exclude file so timeoutToUse is the timeout (120 and 30), A’s
timeout being updated to 30 by line 287 + 311. To some extent, it is by design
as both refreshes refer to the set of hosts inside exclude host file.
Specifically the second refresh is about both A and B instead of B only,
despite your intention is about B. In your case, if the timeout was specified
inside the exclude host file, it will work as you expected.
Although the code could be changed to not update existing node timeout unless
such timeout is from the host file (per node overwrite), I couldn’t tell
quickly whether it is better given possible other side effect. This is
something we can evaluate more.
252 private void handleExcludeNodeList(boolean graceful, Integer timeout) {
276 // Use per node timeout if exist otherwise the request timeout.
277 Integer timeoutToUse = (timeouts.get(n.getHostName()) != null)?
278 timeouts.get(n.getHostName()) : timeout;
283 } else if (s == NodeState.DECOMMISSIONING &&
284 !Objects.equals(n.getDecommissioningTimeout(),
285 timeoutToUse)) {
286 LOG.info("Update " + nodeStr + " timeout to be " +
timeoutToUse);
287 nodesToDecom.add(n);
311 e = new RMNodeDecommissioningEvent(n.getNodeID(), timeoutToUse);
From: Karthik Kambatla [mailto:[email protected]]
Sent: Tuesday, July 26, 2016 6:08 PM
To: Robert Kanter; Zhi, Daniel
Cc: Junping Du; Ming Ma
Subject: Re: YARN-4676 discussion
Related, but orthogonal.
Do we see value in using milliseconds for the timeout? Should it be seconds?
On Tue, Jul 26, 2016 at 5:55 PM Robert Kanter <[email protected]> wrote:
I spoke with Junping and Karthik earlier today. We agreed that to simplify
support for client-side tracking and server-side tracking, we should add a
required -client|server argument. We'll add this in 2.8 to be compatible with
the server-side tracking, when it's ready (probably 2.9); in the meantime, the
-server flag will throw an Exception. This will help simplify things a little
in that client-side tracking and server-side tracking are now mutually
exclusive.
For example
yarn rmadmin -refreshNodes -g 1000 -client
will do client-side tracking, while
yarn rmadmin -refreshNodes -g 1000 -server
will do server-side tracking (once committed).
I've created YARN-5434 to implement the -client|server argument (and fix the
timeout argument not actually being optional). @Junping, please review it.
Once we get YARN-5434 committed, @Daniel, can you update YARN-4676 to support
this? I think that should only require minimal changes to the CLI, and there
was that 5 second overhead we probably don't need anymore.
Long term, I agree that being able to specify the nodes on the CLI would be
easier. I think that's something we should revisit later, once we get the
current work done.
One other thing I noticed, and just double-checked, is that the behavior of
subsequent graceful decom commands causes the previous nodes to update their
timeout. An example, to clarify:
Setup: nodeA and nodeB are in the include file
1. Start a long-running job that has containers running on nodeA
2. Add nodeA to the exclude file
3. Run '-refreshNodes -g 120000' (2min) to begin gracefully decommissioning
nodeA. The CLI blocks, waiting.
4. Wait 30 seconds. Above CLI command should have 90 seconds left
4. Add nodeB to the exclude file
5. Run 'refreshNodes -g 30000' (30sec) in another window
6. Wait 29 seconds. Original CLI command should have 91 seconds left, the new
command should have 1 second
7. After 1 more second, both nodeA and nodeB shutdown and both CLI commands exit
It would be better if the second command had no bearing on the first command.
So in the above example, only nodeB should have shutdown at the 7th step, while
nodeA should have shut down 90 seconds later. Otherwise, you can't concurrent
graceful decoms. If the user wants to change the timeout of an existing decom,
they can recommission it, and then gracefully decommission it again with the
new timeout.
Does that make sense? Did I do something wrong here?
On Tue, Jul 26, 2016 at 9:55 AM, Zhi, Daniel <[email protected]> wrote:
DecommissioningNodesWatcher only tracks nodes in DECOMMISSIONING state during
the decommissioning period. It supposes to be near zero cost for all other
nodes and times.
Upon RM restart, NodesListManager will decommission all hosts that appear in
the exclude host file. That behavior remains the same without or with
YARN-4676. So if RM restart in the middle of “client-side tracking”, the nodes
will become decommissioned right away, instead of waiting for client to
decommission them (it does not even know there is such an client). So overall
the restart scenario support has more to do with persistent and recover node
states instead of client-side vs. server-side tracking.
So I don’t see strong arguments of “client-side tracking” from the performance
impact and restart perspectives. That said, my real concern is how to support
both in RM side so that the code is not too complicated or at least still
remain compatible. For example: the following code inside RMNodeImpl, although
appear under construction, is likely how node become decommissioned under
“client-side tracking”. It is acting on the StatusUpdate event that is
different than DecommissioningNodesWatcher acting on the NodeManager heartbeat
messages. I don’t fully understand the StatusUpdate event mechanism but found
it was conflicting with DecommissioningNodesWatcher. Are these code need to be
restored within an if(clientSideTracking) block? There could be other state
transition or event related code that affects state transition, counters,
resources etc that need to be compatible and verified. Things could get tricky
and complicated, adding to the complexity of supporting both.
1134 public static class StatusUpdateWhenHealthyTransition implements
1159 if (isNodeDecommissioning) {
1160 List<ApplicationId> runningApps = rmNode.getRunningApps();
1161
1162 List<ApplicationId> keepAliveApps =
statusEvent.getKeepAliveAppIds();
1163
1164 // no running (and keeping alive) app on this node, get it
1165 // decommissioned.
1166 // TODO may need to check no container is being scheduled on this
node
1167 // as well.
1168 if ((runningApps == null || runningApps.size() == 0)
1169 && (keepAliveApps == null || keepAliveApps.size() == 0)) {
1170 RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
1171 return NodeState.DECOMMISSIONED;
1172 }
1173
1174 // TODO (in YARN-3223) if node in decommissioning, get node
resource
1175 // updated if container get finished (keep available resource to
be 0)
1176 }
From: Junping Du [mailto:[email protected]]
Sent: Tuesday, July 26, 2016 7:46 AM
To: Zhi, Daniel; Ming Ma; Robert Kanter
Cc: Karthik Kambatla
Subject: Re: YARN-4676 discussion
In my understanding, the values for client tracking are:
- We may not want RM to track things which is not completely necessary in a
very large scale yarn cluster where RM's resource is already the bottleneck.
Although I don't think decommissioningNodesWatcher could consume too much
resources, I would prefer to be a safe player to give user a chance to get rid
of this pressure to RM.
- As a client side tracking tool, user won't expect the timeout tracking to be
solid as rock in any case that means it must have a backup plan if tracking
timeout is essentially important. i.e. it can get tracked also by cluster
management software, like - Ambari or Cloudera Manager. That's why I would like
to emphasize server-side tracking need to address RM restart case because
user's expectations is different with different price we paid.
So I would prefer we could have both ways (client-side tracking and server-side
tracking) for YARN user to choose for different scenarios. Do we have different
opinion here?
I agree with Ming's point that add specifying host list in command line could
be simpler and convenient in some cases, and I also agree it could be next step
as we need to figure out way to persistent these ad-hoc nodes for
decommission/recommission. Ming, would you like to open a JIRA for this
discussion? Thanks!
________________________________________
From: Zhi, Daniel <[email protected]>
Sent: Tuesday, July 26, 2016 5:20 AM
To: Ming Ma; Robert Kanter
Cc: Karthik Kambatla; Junping Du; Zhi, Daniel
Subject: RE: YARN-4676 discussion
The “server-side tracking” (client no wait) is friendly and necessary for
cluster control software that decommission/recommission hosts automatically.
The “client-side tracking” is friendly for admin to manually issue decommission
command and wait for the completion.
The issue that “-refreshNodes -g timeout” is always blocking could be addressed
by an additional “-wait” or “-nowait” flag depends on the default behavior to
choose. Whether it is sufficient or not depends on the purpose “client-side
tracking” --- if the core purpose is to have blocking experience from client
side and ensure forceful decommission after timeout as necessary, then the code
is already doing it. If there is other unspecified server-side value that I am
not aware of, then more changes are needed to bridge the gap (for example:
disengage DecommissioningNodesWatcher inside RM as it will decommission the
node when ready). Overall, can anyone clarify the exact purpose and scope of
“client-side tracking”?
For the scenario mentioned by Ming where admin prefer to directly supply host
in the refresh command for convenience. I see the value to support it, however
the server side logic need merge or consolidate such hosts with ones specified
in exclude/include host file. For example, another --refreshNode could be
invoked the traditional way with conflict setting about the host inside the
exclude/include file. Besides, I am not sure whether a separate command
–decommissionNodes would be better. Overall, it is an alternative convenient
flavor to support, not necessarily core to “client-side tracking” vs
“server-side tracking”.
From: Ming Ma [mailto:[email protected]]
Sent: Tuesday, July 19, 2016 5:47 PM
To: Robert Kanter
Cc: Zhi, Daniel; Karthik Kambatla; Junping Du
Subject: Re: YARN-4676 discussion
I just went over the discussion. It sort of reminded me of discussions we had
when trying to add timeout support to HDFS datanode maintenance mode,
specifically
https://issues.apache.org/jira/browse/HDFS-7877?focusedCommentId=14739108&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14739108.
We considered couple approaches.
• Have admin tools outside hadoop core to manage timeout. There is no
hadoop API to support timeout. The external tools need to track the state such
as expiration and call hadoop to change the node to the final state when
timeout.
• Have hadoop service manage the timeout without persistence. Admin
issues refreshNodes with timeout value and namenode keeps track of the
expiration but will loose the data upon restart.
• Have hadoop service manage the timeout with persistence. It means the
timeout or expiration timestamp needs to be persisted somewhere.
Unless something else comes up, we plan to go with the last approach to support
HDFS maintenance state. Admins have to use host files to config the expiration
timestamp. (we discussed the idea of allowing admins to specify host name and
timeout as part of dfsadmin -refreshNode, but that will require some major
change in terms of how HDFS persists datanode properties.)
For yarn graceful decommission, it is similar in terms of the approaches we can
take.
>From admin API's point of view, there are two approaches we can take:
• Configure everything in the hosts file, including whether it is regular
decommission or graceful decommission at individual host level. Thus client no
longer needs to send RPC graceful decommission flag to RM.
• Configure everything in refreshNode command such as "rmadmin
-refreshNode -g -t 200 -host hostFoo". Our admins like this. This allows
automation tool to decommission nodes from anything machines. However, this
introduces new complexity in terms of how to persists node properties. This
might be the long term solution for later.
Maybe we can also get input from Yahoo and Microsoft folks in terms of the
admin usage by continuing the discussion in the jira?
Thanks all. Overall this feature is quite useful and we look forward to it.
On Fri, Jul 15, 2016 at 4:22 PM, Robert Kanter <[email protected]> wrote:
+Karthik, who I've also been discussing this with internally.
On Thu, Jul 14, 2016 at 6:41 PM, Robert Kanter <[email protected]> wrote:
Pushing the JSON format to a followup JIRA sounds like a good idea to me. I
agree that we should focus on keeping the scope limited enough for YARN-4676 so
we can move forward.
Besides having a way to disable/enable server-side tracking, I think it would
make sense to have a way to disable/enable client-side tracking. Daniel goes
through the different scenarios, and there's a gap here. What if the user
wants to set a non-default timeout, but use server-side tracking only? If they
specify the timeout on the commandline, they'll have to do ctrl-c, which isn't
the greatest user experience and not script-friendly.
Perhaps we should have it so that client-side tracking is normally done, but if
you specify a new optional argument (e.g. "--no-client-tracking" or something)
it won't. This way, we can support client-side only tracking and server-side
only tracking (I suppose this would allow no tracking, but that wouldn't work
right).
Alternatively, we could make client-side and server-side tracking mutually
exclusive, which might simplify things. The default is client-side tracking
only, but if you specify a new optional argument (e.g. "--server" or
something), it will do server-side tracking only.
Thoughts?
I'm kind of leaning towards making them mutually exclusive because the behavior
is easier to understand and more well defined.
- Robert
On Thu, Jul 14, 2016 at 2:49 PM, Zhi, Daniel <[email protected]> wrote:
(added Robert to the thread and updated subject)
Regarding client-side tracking, let me describe my observation of the behavior
to ensure we are all on the same page. I have ran the “-refreshNodes -g
[timeout in seconds]” command before during the patch iterations inside real
cluster (except for the last patch on 06/12/2016 where I couldn’t build AMI as
normal due to Java 8 related conflict with other component in our AMI rolling
process).
I assume the interesting part for this conversation is when timeout value is
specified. It will hit the refreshNodes(int timeout) function in
RMAdminCLI.java which I pasted below for convenience. The logic submit
RefreshNodesRequest request with the user timeout, which will be interpreted by
the server side code as the default timeout (override the one in yarn-site.xml
file). For individual node with explicit timeout specified in exclude host
file, the specific timeout will be used.
The server-side tracking supports (is compatible with) multiple overlapping
refreshNodes requests. For example, at time T1, user might want graceful
decommission 10 nodes, while they are still pending in decommissioning at time
T2, user changed mind to decommission 5 nodes instead (5 nodes will be moved
from exclude file and RM will bring them back to running), later at time T3,
user are free to graceful decommission more or less node as they wish. The
enhanced logic in NodesListManager.java compare the user intention and current
state of the RMNode and issue proper actions (including dynamically modify the
timeout) through events.
The client-side tracking code will force decommission the nodes (line 362) upon
timeout. The timeout was purely tracked by client before (client must keep
running to enforce it). If the client somehow is terminated (for example Ctrl+C
by user) before the timeout, then the timeout is not enforced. Further, if the
same command is run again, the timeout clock resets. Further, user might have
issued refreshNodes() without explicit timeout before, during which the default
timeout will be used by RM, a new refreshNodes(timeout) request with
client-side tracking might not lead to the desired timeout.
The current code (below) tries to balance the client-side tracking and server
side logic so that they are mostly compatible. For the simple scenario where
only client-side tracking is used (assume this is Junping’s scenario), I expect
the user experience be similar. The client code below waits for 5 extra seconds
before force decommission which usually not needed as the node will become
decommissioned before or upon the timeout in RM side. The client will finish
earlier should the node become decommissioned earlier. I have verified this is
the behavior before but not in 06/12.
> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
> Key: YARN-4676
> URL: https://issues.apache.org/jira/browse/YARN-4676
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Daniel Zhi
> Assignee: Daniel Zhi
> Labels: features
> Attachments: GracefulDecommissionYarnNode.pdf,
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch,
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch,
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch,
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch,
> YARN-4676.015.patch, YARN-4676.016.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks
> DECOMMISSIONING nodes status automatically and asynchronously after
> client/admin made the graceful decommission request. It tracks
> DECOMMISSIONING nodes status to decide when, after all running containers on
> the node have completed, will be transitioned into DECOMMISSIONED state.
> NodesListManager detect and handle include and exclude list changes to kick
> out decommission or recommission as necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]