RE: NodeManager health Question

Rohith Sharma K S Thu, 13 Mar 2014 19:56:31 -0700

Hi ,

  As troubleshooting, few things  you can verify


1.     check RM web UI for "Is there any 'Active Nodes' in Yarn cluster"?. 
http://< yarn.resourcemanager.webapp.address>/cluster.

And also verify for "Lost Nodes" or "Unhealthy Nodes" or "Rebooted Nodes".
                 If there any active nodes, then cross verify for "Memory 
Total". This should be "Memory Total  = Number of Active Nodes * value of { 
yarn.nodemanager.resource.memory-mb }"


2.     NodeManger logs give more information. NM logs also check.

>>> In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run
             This may be  your Yarn Cluster does not have enough memory to 
launch container. Possible reason could be

1.     None of the NM are sending heart beat to RM.(check RM Web UI for 
Unhealthy Nodes)

2.     All the NM are lost/unhealthy.

3.     Full cluster capacity is Used. So yarn scheduler is waiting for some 
container to get over, so it can assign released memory to other containers.

        Looking into  your DataNode socket timeout exception ( that too 8 
minutes!!!), I suspect that Hadoop cluster Network is UNSTABLE. Better to debug 
on network.


Thanks & Regards
Rohith Sharma K S

From: Clay McDonald [mailto:[email protected]]
Sent: 14 March 2014 01:30
To: '[email protected]'
Subject: NodeManager health Question

Hello all, I have laid out my POC in a project plan and have HDP 2.0 installed. 
HDFS is running fine and have loaded up about 6TB of data to run my test on. I 
have a series of SQL queries that I will run in Hive ver. 0.12.0. I had to 
manually install Hue and still have a few issues I'm working on there. But at 
the moment, my most pressing issue is with Hive jobs not running. In Yarn, my 
Hive queries are "Accepted" but are "Unassigned" and do not run. See attached.

In Ambari, the datanodes all have the following error; NodeManager health CRIT 
for 20 days CRITICAL: NodeManager unhealthy

>From the datanode logs I found the following;

ERROR datanode.DataNode (DataXceiver.java:run(225)) - 
dc-bigdata1.bateswhite.com:50010:DataXceiver error processing READ_BLOCK 
operation  src: /172.20.5.147:51299 dest: /172.20.5.141:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for 
channel to be ready for write. ch : java.nio.channels.SocketChannel[connected 
local=/172.20.5.141:50010 remote=/172.20.5.147:51299]
            at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
            at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
            at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
            at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
            at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
            at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
            at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
            at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
            at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
            at java.lang.Thread.run(Thread.java:662)

Also, in the namenode log I see the following;

2014-03-13 13:50:57,204 WARN  security.UserGroupInformation 
(UserGroupInformation.java:getGroupNames(1355)) - No groups available for user 
dr.who


If anyone can point me in the right direction to troubleshoot this, I would 
really appreciate it!

Thanks! Clay

RE: NodeManager health Question

Reply via email to