Hi ,
As troubleshooting, few things you can verify
1. check RM web UI for "Is there any 'Active Nodes' in Yarn cluster"?.
http://< yarn.resourcemanager.webapp.address>/cluster.
And also verify for "Lost Nodes" or "Unhealthy Nodes" or "Rebooted Nodes".
If there any active nodes, then cross verify for "Memory
Total". This should be "Memory Total = Number of Active Nodes * value of {
yarn.nodemanager.resource.memory-mb }"
2. NodeManger logs give more information. NM logs also check.
>>> In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run
This may be your Yarn Cluster does not have enough memory to
launch container. Possible reason could be
1. None of the NM are sending heart beat to RM.(check RM Web UI for
Unhealthy Nodes)
2. All the NM are lost/unhealthy.
3. Full cluster capacity is Used. So yarn scheduler is waiting for some
container to get over, so it can assign released memory to other containers.
Looking into your DataNode socket timeout exception ( that too 8
minutes!!!), I suspect that Hadoop cluster Network is UNSTABLE. Better to debug
on network.
Thanks & Regards
Rohith Sharma K S
From: Clay McDonald [mailto:[email protected]]
Sent: 14 March 2014 01:30
To: '[email protected]'
Subject: NodeManager health Question
Hello all, I have laid out my POC in a project plan and have HDP 2.0 installed.
HDFS is running fine and have loaded up about 6TB of data to run my test on. I
have a series of SQL queries that I will run in Hive ver. 0.12.0. I had to
manually install Hue and still have a few issues I'm working on there. But at
the moment, my most pressing issue is with Hive jobs not running. In Yarn, my
Hive queries are "Accepted" but are "Unassigned" and do not run. See attached.
In Ambari, the datanodes all have the following error; NodeManager health CRIT
for 20 days CRITICAL: NodeManager unhealthy
>From the datanode logs I found the following;
ERROR datanode.DataNode (DataXceiver.java:run(225)) -
dc-bigdata1.bateswhite.com:50010:DataXceiver error processing READ_BLOCK
operation src: /172.20.5.147:51299 dest: /172.20.5.141:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch : java.nio.channels.SocketChannel[connected
local=/172.20.5.141:50010 remote=/172.20.5.147:51299]
at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
Also, in the namenode log I see the following;
2014-03-13 13:50:57,204 WARN security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1355)) - No groups available for user
dr.who
If anyone can point me in the right direction to troubleshoot this, I would
really appreciate it!
Thanks! Clay