Thanks Rohith, I restarted the datanodes and all is well.
From: Rohith Sharma K S [mailto:[email protected]] Sent: Thursday, March 13, 2014 10:56 PM To: [email protected] Subject: RE: NodeManager health Question Hi , As troubleshooting, few things you can verify 1. check RM web UI for "Is there any 'Active Nodes' in Yarn cluster"?. http://<<http://%3c> yarn.resourcemanager.webapp.address>/cluster. And also verify for "Lost Nodes" or "Unhealthy Nodes" or "Rebooted Nodes". If there any active nodes, then cross verify for "Memory Total". This should be "Memory Total = Number of Active Nodes * value of { yarn.nodemanager.resource.memory-mb }" 2. NodeManger logs give more information. NM logs also check. >>> In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run This may be your Yarn Cluster does not have enough memory to launch container. Possible reason could be 1. None of the NM are sending heart beat to RM.(check RM Web UI for Unhealthy Nodes) 2. All the NM are lost/unhealthy. 3. Full cluster capacity is Used. So yarn scheduler is waiting for some container to get over, so it can assign released memory to other containers. Looking into your DataNode socket timeout exception ( that too 8 minutes!!!), I suspect that Hadoop cluster Network is UNSTABLE. Better to debug on network. Thanks & Regards Rohith Sharma K S From: Clay McDonald [mailto:[email protected]] Sent: 14 March 2014 01:30 To: '[email protected]' Subject: NodeManager health Question Hello all, I have laid out my POC in a project plan and have HDP 2.0 installed. HDFS is running fine and have loaded up about 6TB of data to run my test on. I have a series of SQL queries that I will run in Hive ver. 0.12.0. I had to manually install Hue and still have a few issues I'm working on there. But at the moment, my most pressing issue is with Hive jobs not running. In Yarn, my Hive queries are "Accepted" but are "Unassigned" and do not run. See attached. In Ambari, the datanodes all have the following error; NodeManager health CRIT for 20 days CRITICAL: NodeManager unhealthy >From the datanode logs I found the following; ERROR datanode.DataNode (DataXceiver.java:run(225)) - dc-bigdata1.bateswhite.com:50010:DataXceiver error processing READ_BLOCK operation src: /172.20.5.147:51299 dest: /172.20.5.141:50010 java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/172.20.5.141:50010 remote=/172.20.5.147:51299] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) at java.lang.Thread.run(Thread.java:662) Also, in the namenode log I see the following; 2014-03-13 13:50:57,204 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1355)) - No groups available for user dr.who If anyone can point me in the right direction to troubleshoot this, I would really appreciate it! Thanks! Clay
