[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273911#comment-15273911
 ] 

Varun Vasudev commented on YARN-4676:
-------------------------------------

bq. 1. HostsFileReader currently allows multiple hosts per line. When hosts are 
pure digits, there will be ambiguity with timeout during interpretation. Likely 
allowing pure digit would requires pure-digit-host starts with a new line.

Yep. The requirement for pure-digit hosts to start with a new line doesn't work 
because there  might be users out there who are using the exclude feature(with 
numeric hostnames) and you'll end up breaking them. You'll have to come up with 
some other way to get the timeout information. What you could do is add support 
for either json/xml formatted files based on the filename suffix. Then you 
should be able to add the information you need and not worry about numeric 
hostname issue.

bq. 2. -1 means infinite timeout (wait forever until ready). null means no 
overwrite, use the default timeout.

Got it. Thank you the explanation.

bq. 3. there could be large number of hosts to be decommissioned so the single 
line could be huge. grep a particular host would return a huge line in that 
case. A mix could be log in a single line for less than N host but otherwise 
multiple line. That said, I am ok to change to single line.

Yeah but when we're tracking the changes to a node, it's much easier when 
grepping through the RM log.

bq. 7. How about DEFAULT_NM_EXIT_WAIT_MS = 0? So that it could be customized in 
cases the delay is preferred.

I'm still not convinced. From what I understand, the issue is that you have a 
supervisor script that is constantly restarting the NM if it shuts down. In the 
case of decommissioned nodes on EMR, this leads to NMs constantly coming up, 
connecting to the RM and shutting down. The timeout doesn't fix the problem - 
it just enforces a delayed shutdown for all errors(even if the node was 
shutdown due to a config problem for example). How about on exit, you write the 
reason for the exit to a well known location(like stdout/stderr or a named 
file). That way the supervisor script can look at the reason the for the exit 
and make a smarter decision on whether it should restart the NM and how long it 
should wait before re-starting the NM.

bq. 8. The grace period is to give RM server-side a chance to DECOMMISSION the 
node should timeout reaches. A much smaller period like 2 seconds most likely 
would be sufficient as NodeManager heartbeat every second during which 
DECOMMISSIONING node will be re-evaluated and decommissioned if ready or 
timeout.

Sorry I'm confused here - from my understanding of the code - if the user has 
asked for a 20 second timeout, you're internally treating that as a 40 second 
timeout. That's not the expected behaviour. Is my understanding wrong?

{quote}
15. I agree that getDecommissioningStatus suggest the call is read-only. Since 
completed apps need to be take into account when evaluate readiness of the 
node, getDecommissioningStatus is actually a private method used internally so 
it could be changed into private checkDecommissioningStatus(nodeId).

22. the call simply returns if within 20 seconds of last call. Currently it 
lives inside ResourceTrackerService and uses rmContext. Alternatively 
DecommissioningNodesWatcher could be constructed with rmContext and internally 
has its own polling thread. Other than not sure yet the code pattern to use for 
such internal thread, it appears as valid alternative to me.
{quote}

I'm going to address both of these together because they're related in my mind. 
I think it's a cleaner solution to put the poll function in its own thread with 
its own timer than to call it for every node heartbeat. It does away with 
checks like last run time; you can make checkDecommissioningStatus(nodeId) a 
part of the poll function; it consolidates most of the de-commissioning logic 
instead of spreading it across the ResourceTrackerService and the 
DecommissioningWatcher; it also let's you increase/decrease the frequency of 
the poll by making it configurable(in the timer setting) instead of adding 
hard-coded numbers like 20 seconds in the code.

bq. 9. "yarn rmadmin -refreshNodes -g -1" waits forever until the node is 
ready. "yarn rmadmin -refreshNodes -g" uses default timeout as specified by the 
configuration key.

Thank you for the clarification.

bq. 14. Here is an example of the tabular logging. Keeping DECOMMISSIONED node 
a little longer prevent it from suddenly disappeared from the list after 
DECOMMISSIONed.

Sorry if I'm misunderstanding here. The use case is that you're grepping out 
the log lines from the decommissioned node watcher to determine when a node got 
decommissioned and keeping the node around for 20s longer ensures that the 
decommissioned state is logged by the DecommissioningWatcher.

bq. 16. readDecommissioningTimeout is to pick up new value without restart RM. 
It was requested by EMR customers and I do see the user scenarios. It is only 
invoked when there are DECOMMISSIONED nodes and will only be invoked once every 
20 seconds (poll period). I have to maintain private patch or consider other 
option if remove the feature.

If this is something that users change often, you can add an option to the 
command to specify the defaultTimeout for the specific run of the command or an 
option to update the default timeout. Don't create a new YarnConfiguration 
every time.

bq. 18. The method return number of seconds to timeout. I don't mind changing 
the name to getTimeoutTimestampInSec() but don't see the reason behind.

My apologies. You are correct - I misunderstood the function. The current name 
is fine.

bq. 19. see the example in 14. This is once every 20 seconds and was very 
useful during my development and testing of the work. I see more valuable to 
leave it as INFO but as the code become mature and stable, maybe ok to turn 
into DEBUG.

I agree that this is very useful when debugging or testing software. Users can 
turn on debug logging when they need to figure out what's going on. We already 
have an issue with excessive logging in the RM log.

bq. 21. The isValidNode() && isNodeInDecommissioning() condition is just a very 
quick shallow check — for a DECOMMISSIONING node, although nodesListManager 
would return false for isValidNode() as the node appear in excluded host list, 
such node will be allowed to continue as it is in the middle of 
DECOMMISSIONING. During the process of the heart beat, decommissioningWatcher 
is updated with the latest container status of the node; Later 
decomWatcher.checkReadyToBeDecommissioned(rmNode.getNodeID()) evaluates its 
readiness and DECOMMISSION the node if ready (include timeout).

Got it. Thanks for the explanation.

bq. 25. Instead of disallow and exit, an alternative way is to allow the 
graceful decommission as usual. There will be no difference if no RM restart 
during the session. In case RM restart, currently all excluded nodes 
decommissioned right away, an enhanced support in future will resume it. 

No - this just leads to bad user experiences. If the user runs a graceful 
decommission, they expect the nodes to be decommissioned gracefully, 
irrespective of the failover scenarios. Causing a forceful decommission when 
the user asked for a graceful decommission is going against what the user wants.

One addition review note - can you please rename DecomNodeStatus to 
DecommissioningNodeStatus?

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, 
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch, 
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch, 
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch, 
> YARN-4676.012.patch, YARN-4676.013.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to 
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager 
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission 
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to 
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning 
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The 
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves 
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful 
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to