Arjun Mohnot created YARN-11730: ----------------------------------- Summary: Resourcemanager node reporting enhancement for unregistered hosts Key: YARN-11730 URL: https://issues.apache.org/jira/browse/YARN-11730 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Affects Versions: 3.4.0 Environment: Tested on multiple environments:
A. Docker Environment{*}:{*} * Base OS: *Ubuntu 20.04* * *Java 8* installed from OpenJDK. * Docker image includes Hadoop binaries, user configurations, and ports for YARN services. * Verified behavior using a Hadoop snapshot in a containerized environment. * Performed Namenode formatting and validated service interactions through exposed ports. * Repo reference: [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} * Running *Java 8* in a High-Availability (HA) configuration with *Zookeeper* for locking mechanism. * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM node, including state retention and proper node state transitions. * Verified node state transitions during RM failover, ensuring nodes moved between LOST, ACTIVE, and other states as expected. Reporter: Arjun Mohnot Fix For: 3.5.0 h3. Issue Overview When the ResourceManager (RM) starts, nodes listed in the _"include"_ file are not immediately reported until their corresponding NodeManagers (NMs) send their first heartbeat. However, nodes in the _"exclude"_ file are instantly reflected in the _"Decommissioned Hosts"_ section with a port value -1. This design creates several challenges: * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM standalone restart, some nodes may not report back, even though they are listed in the _"include"_ file. These nodes neither appear in the _LOST_ state nor are they represented in the RM's JMX metrics. This results in an untracked state, making it difficult to monitor their status. While in HDFS similar behaviour exists and is marked as {_}"DEAD"{_}. * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until they send their first heartbeat. This delay impacts real-time cluster monitoring, leading to a lack of immediate visibility for these nodes in Resourcemanager's state on the total no. of nodes. * {*}Operational Impact{*}: These unreported nodes cause operational difficulties, particularly in automated workflows such as OS Upgrade Automation (OSUA), node recovery automation, and others where validation depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky workarounds to determine their accurate status. h3. Proposed Solution To address these issues, we propose automatically assigning the _LOST_ state to any node listed in the _"include"_ file by default at the RM startup or HA failover. This can be done by marking the node with a special port value {_}-2{_}, signaling that the node is considered LOST but has not yet been reported. Whenever a heartbeat is received for that {color:#de350b}nodeID{color}, it will be transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other required desired state. h3. Key implementation points * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of the RM active node context should be automatically marked as {_}LOST{_}. This can be achieved by modifying the _NodesListManager_ under the {color:#de350b}refreshHostsReader{color} method, invoked during failover, or manual node refresh operations. This logic should ensure that all unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating the node is untracked. * For non-HA setups, this process can be triggered during RM service startup to mark nodes as _LOST_ initially, and they will gradually transition to their desired state when the heartbeat is received. * Handle Node Heartbeat and Transition: When a node sends its first heartbeat, the system should verify if the node is listed in {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ state, the RM should remove it from the inactive list, decrement the _LOST_ node count, and handle the transition back to the active node set. * This logic can be placed in the state transition method within {color:#de350b}RMNodeImpl.java{color}, ensuring that nodes transitioned from _NEW_ to _LOST_ state, and recover gracefully from the _LOST_ state upon receiving their heartbeat. h3. Benefits * {*}Improved Cluster Monitoring{*}: Automatically assigning a _LOST_ state to nodes listed in the _"include"_ file but not reporting ensures that every node in the cluster has a well-defined state ({_}ACTIVE{_}, {_}LOST{_}, {_}DECOMMISSIONED{_}, {_}UNHEALTHY, etc{_}). This eliminates any potential gaps in cluster node visibility and simplifies operational monitoring. * {*}Better Recovery Management{*}: By marking unreported nodes as {_}LOST{_}, automation can quickly identify which nodes require attention during recovery efforts to restore cluster health. This prevents confusion between unreachable nodes and untracked nodes, improving recovery accuracy. * {*}Enhanced Cluster Stability{*}: This approach improves overall stability by preventing nodes from slipping into an untracked or unknown state. It guarantees that the system remains aware of all nodes, reducing issues during RM failover or restart scenarios. h3. Additional Considerations * Feature Flag Control: This feature will be enabled/disabled via a configuration flag, allowing users to adjust behavior based on their requirements. By default, it is marked as {_}False{_}. * Enough Validations: The approach has been well-tested on non-HA and HA setups, and a dummy docker-based [setup|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] has been created to replicate the behavior. Added the required unit test cases to validate the code behavior. Demo [video|https://drive.google.com/file/d/1okiPe7uMNVMRUnNYtz-B8Igf8FMGr-SJ/view?usp=sharing] for this change. Any thoughts/suggestions/feedback are welcome! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org