[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144004#comment-14144004
 ] 

Hadoop QA commented on YARN-2578:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch
  against trunk revision 23e17ce.

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5071//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5071//console

This message is automatically generated.

> NM does not failover timely if RM node network connection fails
> ---------------------------------------------------------------
>
>                 Key: YARN-2578
>                 URL: https://issues.apache.org/jira/browse/YARN-2578
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.1
>            Reporter: Wilfred Spiegelenburg
>         Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
>     node 1: ZK, NN, JN, ZKFC, DN, RM, NM
>     node 2: ZK, NN, JN, ZKFC, DN, RM, NM
>     node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x00007f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x00007f5a51fc1000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
>       at java.lang.Object.wait(Object.java:503)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>       - locked <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>       at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>       at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to