Wilfred Spiegelenburg created YARN-2578:
-------------------------------------------

             Summary: NM does not failover timely if RM node network connection 
fails
                 Key: YARN-2578
                 URL: https://issues.apache.org/jira/browse/YARN-2578
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.5.1
            Reporter: Wilfred Spiegelenburg


The NM does not fail over correctly when the network cable of the RM is 
unplugged or the failure is simulated by a "service network stop" or a firewall 
that drops all traffic on the node. The RM fails over to the standby node when 
the failure is detected as expected. The NM should than re-register with the 
new active RM. This re-register takes a long time (15 minutes or more). Until 
then the cluster has no nodes for processing and applications are stuck.

Reproduction test case which can be used in any environment:
- create a cluster with 3 nodes
    node 1: ZK, NN, JN, ZKFC, DN, RM, NM
    node 2: ZK, NN, JN, ZKFC, DN, RM, NM
    node 3: ZK, JN, DN, NM
- start all services make sure they are in good health
- kill the network connection of the RM that is active using one of the network 
kills from above
- observe the NN and RM failover
- the DN's fail over to the new active NN
- the NM does not recover for a long time
- the logs show a long delay and traces show no change at all

The stack traces of the NM all show the same set of threads. The main thread 
which should be used in the re-register is the "Node Status Updater" This 
thread is stuck in:
{code}
"Node Status Updater" prio=10 tid=0x00007f5a6cc99800 nid=0x18d0 in 
Object.wait() [0x00007f5a51fc1000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.hadoop.ipc.Client.call(Client.java:1395)
        - locked <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
        at org.apache.hadoop.ipc.Client.call(Client.java:1362)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
{code}

The client connection which goes through the proxy can be traced back to the 
ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
should be using a version which takes the RPC timeout (from the configuration) 
as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to